Recent

Author Topic: 4-bytes characters bug!  (Read 1273 times)

Ed78z

  • New Member
  • *
  • Posts: 45
4-bytes characters bug!
« on: July 01, 2025, 10:08:33 pm »
I have issue with all TCustomEdit components (TEdit, TMemo, TRichMemo,...)

Those components works fine with all 1,2,3-bytes characters and the issue is just for the 4-bytes characters (𝑒 𝜵 𝒾 𝝏 𝒿 ....)

They are considering 2-chars for the 4-bytes characters. For example, the string of '12💓💔' has 4 characters (2x1-byte + 2x4-byte), however, the Caret or the SelStart properties shows 6 (not 4). this causes issue when trying to insert a string or a char at any position of the string after the first 4-byte character existed in the string. (in this example positions 3 or 4)

Does anyone knows how to fix this issue?
This is a test code to help you implement it quick: (add a Button and a Memo on the Form)

Code: Pascal  [Select][+][-]
  1. implementation
  2.  
  3. {$R *.lfm}
  4.  
  5. { TForm1 }
  6.  
  7. Uses LazUTF8;
  8.  
  9. Procedure InsChar(Var M: TMemo; C: String);
  10. var
  11.   P, Len: integer;
  12.   TextBefore, TextAfter: string;
  13. begin
  14.   Len := UTF8LengthFast(C);
  15.   P := M.SelStart; // <-- this is the issue
  16.   TextBefore := UTF8Copy(M.Text, 1, P);
  17.   TextAfter := UTF8Copy(M.Text, P + 1, UTF8LengthFast(M.Text) - P);
  18.   M.Text := TextBefore + C + TextAfter;
  19.   M.SelStart := P + Len;
  20.   M.SelLength := 0;
  21.   M.SetFocus;
  22. end;
  23.  
  24.  
  25. procedure TForm1.FormCreate(Sender: TObject);
  26. begin
  27.   Memo1.WordWrap:=True;
  28.   Memo1.text:='12&#128147;&#128148;&#128149;&#128150;&#128151;&#128152;&#128155;&#128156;&#128157;&#128158;&#128159;34';
  29. end;
  30.  
  31. procedure TForm1.Button1Click(Sender: TObject);
  32. begin
  33.   //insert at the current caret position
  34.   InsChar(Memo1,'&#128176;&#128533;');
  35. end;
  36.  
  37. end.
  38.  

Memo1.text:='12💓💔💕💖💗💘💛💜💝💞💟34';
InsChar(Memo1,'💰😕');


Lazarus 4.0
Windows 11 x64

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11476
  • Debugger - SynEdit - and more
    • wiki
Re: 4-bytes characters bug!
« Reply #1 on: July 01, 2025, 10:16:41 pm »
Just taking a guess here.

Windows? So TMemo as a Windows control will be given Utf16 (not a problem, easy to translate).

But the in utf16 this
https://www.fileformat.info/info/unicode/char/1f493/index.htm
uses a surrogate pair.
In other words it uses 2 codeunits (2 words) to represent the char. The windows API will then report the positions accordingly.

It appears that this is not taken into account, when the LCL/WS code translates the positions back to the utf8 positions.

If I am guessing right its a bug in the LCL.




You say the caret reports 6 => that is when you read it from code?

If you focus the memo and use the cursor left/right keys, then the caret will go to 4 (actually 5 front/back and 3 in between) positions? They are just not reported as continuos pos?
« Last Edit: July 01, 2025, 10:19:09 pm by Martin_fr »

Ed78z

  • New Member
  • *
  • Posts: 45
Re: 4-bytes characters bug!
« Reply #2 on: July 01, 2025, 10:31:17 pm »
Yes,
The caret or selstart reports wrong.
Assume '12💓💔';
Starting from left it returns 1,2,4,6 instead of 1,2,3,4

ASerge

  • Hero Member
  • *****
  • Posts: 2443
Re: 4-bytes characters bug!
« Reply #3 on: July 02, 2025, 02:38:50 am »
Yes,
The caret or selstart reports wrong.
Assume '12💓💔';
Starting from left it returns 1,2,4,6 instead of 1,2,3,4
There's nothing wrong with that.
WinAPI assumes code units, not code points.
Let's go from left to right:
SelStart=0
'1', +1 codeunit, SelStart=1
'2', +1 codeunit, SelStart=2
'💓', +2 codeunit (1 codepoint), SelStart=4
'💔', +2 codeunit (1 codepoint), SelStart=6
Moreover, you can move SelStart inside codepoint by program: SelStart := 3. But it will look strange.
When typing directly, memo does not allow such things and moves the cursor exclusively over codepoint positions.

Edit:
When you type text directly, the memo moves cursor through the glyphs. Code points can be combined (up to 3) into one glyph (visible character).
In the case of 3 code points, it will be 6 code units (SelStarts) or 12 bytes per one character!
« Last Edit: July 02, 2025, 02:49:03 am by ASerge »

Ed78z

  • New Member
  • *
  • Posts: 45
Re: 4-bytes characters bug!
« Reply #4 on: July 02, 2025, 04:11:46 am »
Thank you ASerge,
The issue is: I am getting current position of caret by SelStart to insert a char or a string at this position.
Assume the caret (|) is on the third character: '12💓|💔', the SelStart will return 4 instead of 3, this will cause issue when I'm copying with UTF8Copy since this function works fine and internally take care of multibyte characters.
So, inserting '💰😕' at position 3 of '12💓💔' will result as '12💓💔💰😕', because the SelStart returned 4 not 3.

If you check the test code that I provided earlier, you will understand what I am saying exactly.

Thanks and I really appreciate your help!

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1538
    • Lebeau Software
Re: 4-bytes characters bug!
« Reply #5 on: July 02, 2025, 07:08:52 am »
The issue is: I am getting current position of caret by SelStart to insert a char or a string at this position.
Assume the caret (|) is on the third character: '12💓|💔', the SelStart will return 4 instead of 3, this will cause issue when I'm copying with UTF8Copy since this function works fine and internally take care of multibyte characters.
So, inserting '💰😕' at position 3 of '12💓💔' will result as '12💓💔💰😕', because the SelStart returned 4 not 3.

The problem is you are not taking into account that the UI controls on Windows use a different character encoding than your strings are using. Your strings are using UTF-8, but on Windows your UI controls do not use UTF-8, they use ANSI or UTF-16, depending on how you create them. You need to take the UI encoding into account when translating characters/indexes between your UI and your strings. It is not a 1:1 mapping.

In any case, you have over-complicated your InsChar() function. Let the UI do the hard work for you. Use the SelText property to replace the current selection, eg:

Code: Pascal  [Select][+][-]
  1. Procedure InsChar(M: TMemo; C: String);
  2. var
  3.   P, Len: integer;
  4. begin
  5.   P := M.SelStart;
  6.   // calculate the length of the text that is not selected
  7.   Len := M.GetTextLen() - M.SelLength;
  8.   // replace the current selection
  9.   M.SelText := C;
  10.   // calculate how many characters were actually inserted
  11.   // into the selection and advance the selection start
  12.   M.SelStart := P + (M.GetTextLen() - Len);
  13.   M.SelLength := 0;
  14.   M.SetFocus;
  15. end;
  16.  
« Last Edit: July 02, 2025, 06:40:27 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 11476
  • Debugger - SynEdit - and more
    • wiki
Re: 4-bytes characters bug!
« Reply #6 on: July 02, 2025, 10:25:21 am »
I'd still say the LCL may need review.

1) On Linux it seems to return according to utf8 codepoints (not tested combining), but the heart gets a ONE inc.

2) The LCL converts between utf8 and utf16. The pos in utf16 codepoints is completely useless.

paule32

  • Hero Member
  • *****
  • Posts: 603
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #7 on: July 02, 2025, 12:58:55 pm »
I agree with Remy.
Windows use 16 Bit Encoding UTF-16.
this means, you have 4 Bytes for each Character/Letter in a String.
UTF-8 will be supported - that consume only 8 Bit => 2 Bytes for each Letter in a String.
ANSI will supported, too (also 4 Bit => 1 Bytes => 255 Characters)

On this fact, it give different Windows API Function's

They are splitted into A and W Function's like MessageBoxA or MessageBoxW.

A stands for Ansi (255 Letters for each Byte)
W stands for WideChar/UTF-8 and UTF-16
« Last Edit: July 02, 2025, 01:01:15 pm by paule32 »
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

cdbc

  • Hero Member
  • *****
  • Posts: 2264
    • http://www.cdbc.dk
Re: 4-bytes characters bug!
« Reply #8 on: July 02, 2025, 01:45:16 pm »
Hi
@paule32: You really shouldn't talk about things, that you do not understand:
Quote
Windows use 16 Bit Encoding UTF-16.
this means, you have 4 Bytes for each Character/Letter in a String.
UTF-8 will be supported - that consume only 8 Bit => 2 Bytes for each Letter in a String.
ANSI will supported, too (also 4 Bit => 1 Bytes => 255 Characters)

On this fact, it give different Windows API Function's

They are splitted into A and W Function's like MessageBoxA or MessageBoxW.

A stands for Ansi (255 Letters for each Byte)
W stands for WideChar/UTF-8 and UTF-16
I really hope, that no 'noob's read the above quote... PLEASE GO STUDY STRINGS SUBJECT! paule!!!
Since when is 4 bits = 1 byte?!?
The difference between 'Ansi' and 'Ascii'?!?
My condolences paule - Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 3.6 up until Jan 2024 from then on it's both above &: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 4.99

paule32

  • Hero Member
  • *****
  • Posts: 603
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #9 on: July 02, 2025, 02:16:06 pm »
yes, shame on my head...  :o

I was squirel a little bit.

the original ANSI Letters starts by 0 till 127.
the extended ANSI Letters ends by 255.

So, you have two 4 Bits of each (4 for original, and 4 for extended Character Set).
0 0 0 0 - 1 1 1 1 = 8 Bits = 1 Byte
16        *      16 = 256 - 1
« Last Edit: July 02, 2025, 02:20:35 pm by paule32 »
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

Thaddy

  • Hero Member
  • *****
  • Posts: 17451
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: 4-bytes characters bug!
« Reply #10 on: July 02, 2025, 02:23:30 pm »
ASCII is 7 bit, not 8. But it is expanded to 8 in normal code.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

paule32

  • Hero Member
  • *****
  • Posts: 603
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #11 on: July 02, 2025, 02:29:38 pm »
yes, 7 Bit.
but CPU calculate in 8 Bit, so, in ANSI you left 1 Bit - but nobody realized it.

The 7 Bit is an artefact knowledge of the first e-Mail through two universities.
The people thoughts that 7 Bit are enough for tele type machines...

But some month later, the global players needs more comfort...
And it will never stop...
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

PascalDragon

  • Hero Member
  • *****
  • Posts: 6049
  • Compiler Developer
Re: 4-bytes characters bug!
« Reply #12 on: July 03, 2025, 09:21:14 pm »
The problem is you are not taking into account that the UI controls on Windows use a different character encoding than your strings are using. Your strings are using UTF-8, but on Windows your UI controls do not use UTF-8, they use ANSI or UTF-16, depending on how you create them. You need to take the UI encoding into account when translating characters/indexes between your UI and your strings. It is not a 1:1 mapping.

Well, with current Windows versions (starting from some version of Windows 10) you can enable UTF-8 as an encoding for the A API. Lazarus provides a switch for this in the manifest settings.

Ed78z

  • New Member
  • *
  • Posts: 45
Re: 4-bytes characters bug!
« Reply #13 on: July 03, 2025, 10:05:08 pm »
I'd still say the LCL may need review.

1) On Linux it seems to return according to utf8 codepoints (not tested combining), but the heart gets a ONE inc.

2) The LCL converts between utf8 and utf16. The pos in utf16 codepoints is completely useless.

Yes, you are right, all the TCustomEdit components needs to be fixed. for example in a TRichMemo, if you use RichMemo1.SelectAll (which is an internal function) it won't select all the text in the RichMemo1 [I'm talking about whenever there are bunch of 4-bytes characters in the RichMemo1.Text]

Also, there is no way to get the index of TCustomEdit.Lines[index].text based on CaretPos.Y or SelStart values when the Wordrwap=True and the line is a very long line.

Thaddy

  • Hero Member
  • *****
  • Posts: 17451
  • Ceterum censeo Trumpum esse delendum (Tnx Charlie)
Re: 4-bytes characters bug!
« Reply #14 on: July 04, 2025, 06:18:44 pm »
Richmemo is not part of the LCL.
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

 

TinyPortal © 2005-2018