Recent

Author Topic: How can a TSynMemo handle an UTF-8 string ?  (Read 4350 times)

Basile B.

  • Guest
How can a TSynMemo handle an UTF-8 string ?
« on: November 17, 2014, 01:58:24 pm »
I'm developing an IDE based on the SynEdit suite. Until a few days ago I was very happy. But I've realized that it doesn't handle well UTF-8 chars.

for example take the following string:

> "ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ"

it's displayed, in an editor (default charset, monospaced font):

if I put the cursor before the first double quote:
> "ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ                            ",

if I put the cursor at the end of line, after the comma:
> "ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ"                            ,

If I choose the OEM charset, then the display is better, except that I only get squared characters.

> "□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□",

Actually I noticed that even in Lazarus there's the same display issue, for example, play with:

> 'ຂອ້ຍກິນແກ້ວໄດ້ໂດຍທີ່ມັນບໍໄດ້ເຮັດໃຫ້ຂອ້ຍເຈັບ',

if you put the cursor before and after the first char you'll see that the final quote moves.

Actually that's not a big big problem, but I'd like to see if it can be fixed.

« Last Edit: November 17, 2014, 04:00:43 pm by Basile B. »

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #1 on: November 17, 2014, 02:15:07 pm »
Yes, I notice also that the selection behavior is very strange when I try to select a part of the string.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 7023
  • Debugger - SynEdit - and more
    • wiki
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #2 on: November 17, 2014, 02:29:29 pm »
This is an issue with the font. Not an utf8 issue

SynEdit can only deal with mono spaced fonts.

However, I have noticed, that some systems use proportional fonts for certain languages, even if the font should be monospaced.

Since SynEdit believes that the system would have drawn the text monospaced, SynEdit calculates position of the caret, the selection highlight, or the next char after a highlight change or RTL/LTR break based on the monospaced width of a char.

Setting "Extra Char Spacing" <> 0 will force  strictly monospaced output (but not look good. (And script languages, will look broken on gtk/carbon)



In result you need to find a font that works for those letters. And if you distribute your app, so does every user.

To fix this in SynEdit, it needs support for proportional fonts. But that is still far amay.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 7023
  • Debugger - SynEdit - and more
    • wiki
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #3 on: November 17, 2014, 02:34:52 pm »
EDIT: this will not work. See below

You can edit

components\synedit\lazsyntextarea.pp

line 1535 (trunk)
Code: [Select]
      // Prepare FETOBuf
      if FTextDrawer.NeedsEto or ATokenInfo.HasDoubleWidth
         {$IFDEF Windows} or ATokenInfo.RtlInfo.IsRtl {$ENDIF}  // RTL may have script with ligature
      then begin

change to "If true"

Or any condition based on the text in the token.

On GTK and Carbon, this will break script based languages. Instead of drawing the letters as a continuous word, you will see each letter individually (as if separated by (invisible) spaces)
« Last Edit: November 17, 2014, 02:55:02 pm by Martin_fr »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 7023
  • Debugger - SynEdit - and more
    • wiki
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #4 on: November 17, 2014, 02:40:52 pm »
Actually, I might be wvrong.

It may not be the font.

some chars ar made from several code points.
Quote
ອ້

I made from 2 codepoints. SyntEdit counts them wrong.

This clearly is a bug.

You can see this, when selecting text from left to right, one by one. The char becomes separated.


Yes SynEdit does not handle those combined codepoint correctly yet.

What language is this?

It needs to be checked, maybe they need to be added to
components\synedit\synedittextbuffer.pp  line 891
Code: [Select]
function TSynEditStringList.LogicPosIsCombining(const AChar: PChar): Boolean;
« Last Edit: November 17, 2014, 02:43:09 pm by Martin_fr »

Bart

  • Hero Member
  • *****
  • Posts: 4225
    • Bart en Mariska's Webstek
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #5 on: November 17, 2014, 02:41:30 pm »
This is an issue with the font. Not an utf8 issue

SynEdit can only deal with mono spaced fonts.

However, I have noticed, that some systems use proportional fonts for certain languages, even if the font should be monospaced.

Hmm, I've seen this happening with Courier New also.
For some of my test for handling files with UTF8 characters, I have resorted to using #-notation for the string constants, otherwise I experience the same problem as above.

Bart

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 7023
  • Debugger - SynEdit - and more
    • wiki
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #6 on: November 17, 2014, 03:00:04 pm »
Some of the chars are not supported by SynEdit. SynEdit does not yet use the OS to handle unicode correctly, and it only implements a small subset.

The text contains
 'ZERO WIDTH NO-BREAK SPACE'
which A am not sure SynEdit will deal with.

'LAO TONE MAI THO'  which is a Non-Spacing Mark, but SynEdit will allocate space for it.

I dont know the full unicode spec on this, so I do not know, when it will be rendered which way.

You can try to add it to combining marks. It is not the correct fix, but if you are lucky it will work.

Basile B.

  • Guest
Re: How can a TSYnMemo handle an UTF-8 string ?
« Reply #7 on: November 17, 2014, 04:00:06 pm »
Ok, so this not an error from my side. Good to know.
I let someone else filling a bug report if needed.

 

TinyPortal © 2005-2018