Recent

Author Topic: 4-bytes characters bug!  (Read 2152 times)

Ed78z

  • Jr. Member
  • **
  • Posts: 55
4-bytes characters bug!
« on: July 01, 2025, 10:08:33 pm »
I have issue with all TCustomEdit components (TEdit, TMemo, TRichMemo,...)

Those components works fine with all 1,2,3-bytes characters and the issue is just for the 4-bytes characters (𝑒 𝜵 𝒾 𝝏 𝒿 ....)

They are considering 2-chars for the 4-bytes characters. For example, the string of '12💓💔' has 4 characters (2x1-byte + 2x4-byte), however, the Caret or the SelStart properties shows 6 (not 4). this causes issue when trying to insert a string or a char at any position of the string after the first 4-byte character existed in the string. (in this example positions 3 or 4)

Does anyone knows how to fix this issue?
This is a test code to help you implement it quick: (add a Button and a Memo on the Form)

Code: Pascal  [Select][+][-]
  1. implementation
  2.  
  3. {$R *.lfm}
  4.  
  5. { TForm1 }
  6.  
  7. Uses LazUTF8;
  8.  
  9. Procedure InsChar(Var M: TMemo; C: String);
  10. var
  11.   P, Len: integer;
  12.   TextBefore, TextAfter: string;
  13. begin
  14.   Len := UTF8LengthFast(C);
  15.   P := M.SelStart; // <-- this is the issue
  16.   TextBefore := UTF8Copy(M.Text, 1, P);
  17.   TextAfter := UTF8Copy(M.Text, P + 1, UTF8LengthFast(M.Text) - P);
  18.   M.Text := TextBefore + C + TextAfter;
  19.   M.SelStart := P + Len;
  20.   M.SelLength := 0;
  21.   M.SetFocus;
  22. end;
  23.  
  24.  
  25. procedure TForm1.FormCreate(Sender: TObject);
  26. begin
  27.   Memo1.WordWrap:=True;
  28.   Memo1.text:='12&#128147;&#128148;&#128149;&#128150;&#128151;&#128152;&#128155;&#128156;&#128157;&#128158;&#128159;34';
  29. end;
  30.  
  31. procedure TForm1.Button1Click(Sender: TObject);
  32. begin
  33.   //insert at the current caret position
  34.   InsChar(Memo1,'&#128176;&#128533;');
  35. end;
  36.  
  37. end.
  38.  

Memo1.text:='12💓💔💕💖💗💘💛💜💝💞💟34';
InsChar(Memo1,'💰😕');


Lazarus 4.0
Windows 11 x64

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12423
  • Debugger - SynEdit - and more
    • wiki
Re: 4-bytes characters bug!
« Reply #1 on: July 01, 2025, 10:16:41 pm »
Just taking a guess here.

Windows? So TMemo as a Windows control will be given Utf16 (not a problem, easy to translate).

But the in utf16 this
https://www.fileformat.info/info/unicode/char/1f493/index.htm
uses a surrogate pair.
In other words it uses 2 codeunits (2 words) to represent the char. The windows API will then report the positions accordingly.

It appears that this is not taken into account, when the LCL/WS code translates the positions back to the utf8 positions.

If I am guessing right its a bug in the LCL.




You say the caret reports 6 => that is when you read it from code?

If you focus the memo and use the cursor left/right keys, then the caret will go to 4 (actually 5 front/back and 3 in between) positions? They are just not reported as continuos pos?
« Last Edit: July 01, 2025, 10:19:09 pm by Martin_fr »

ASerge

  • Hero Member
  • *****
  • Posts: 2498
Re: 4-bytes characters bug!
« Reply #2 on: July 02, 2025, 02:38:50 am »
Yes,
The caret or selstart reports wrong.
Assume '12💓💔';
Starting from left it returns 1,2,4,6 instead of 1,2,3,4
There's nothing wrong with that.
WinAPI assumes code units, not code points.
Let's go from left to right:
SelStart=0
'1', +1 codeunit, SelStart=1
'2', +1 codeunit, SelStart=2
'💓', +2 codeunit (1 codepoint), SelStart=4
'💔', +2 codeunit (1 codepoint), SelStart=6
Moreover, you can move SelStart inside codepoint by program: SelStart := 3. But it will look strange.
When typing directly, memo does not allow such things and moves the cursor exclusively over codepoint positions.

Edit:
When you type text directly, the memo moves cursor through the glyphs. Code points can be combined (up to 3) into one glyph (visible character).
In the case of 3 code points, it will be 6 code units (SelStarts) or 12 bytes per one character!
« Last Edit: July 02, 2025, 02:49:03 am by ASerge »

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 1595
    • Lebeau Software
Re: 4-bytes characters bug!
« Reply #3 on: July 02, 2025, 07:08:52 am »
The issue is: I am getting current position of caret by SelStart to insert a char or a string at this position.
Assume the caret (|) is on the third character: '12💓|💔', the SelStart will return 4 instead of 3, this will cause issue when I'm copying with UTF8Copy since this function works fine and internally take care of multibyte characters.
So, inserting '💰😕' at position 3 of '12💓💔' will result as '12💓💔💰😕', because the SelStart returned 4 not 3.

The problem is you are not taking into account that the UI controls on Windows use a different character encoding than your strings are using. Your strings are using UTF-8, but on Windows your UI controls do not use UTF-8, they use ANSI or UTF-16, depending on how you create them. You need to take the UI encoding into account when translating characters/indexes between your UI and your strings. It is not a 1:1 mapping.

In any case, you have over-complicated your InsChar() function. Let the UI do the hard work for you. Use the SelText property to replace the current selection, eg:

Code: Pascal  [Select][+][-]
  1. Procedure InsChar(M: TMemo; C: String);
  2. var
  3.   P, Len: integer;
  4. begin
  5.   P := M.SelStart;
  6.   // calculate the length of the text that is not selected
  7.   Len := M.GetTextLen() - M.SelLength;
  8.   // replace the current selection
  9.   M.SelText := C;
  10.   // calculate how many characters were actually inserted
  11.   // into the selection and advance the selection start
  12.   M.SelStart := P + (M.GetTextLen() - Len);
  13.   M.SelLength := 0;
  14.   M.SetFocus;
  15. end;
  16.  
« Last Edit: July 02, 2025, 06:40:27 pm by Remy Lebeau »
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12423
  • Debugger - SynEdit - and more
    • wiki
Re: 4-bytes characters bug!
« Reply #4 on: July 02, 2025, 10:25:21 am »
I'd still say the LCL may need review.

1) On Linux it seems to return according to utf8 codepoints (not tested combining), but the heart gets a ONE inc.

2) The LCL converts between utf8 and utf16. The pos in utf16 codepoints is completely useless.

paule32

  • Hero Member
  • *****
  • Posts: 647
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #5 on: July 02, 2025, 12:58:55 pm »
I agree with Remy.
Windows use 16 Bit Encoding UTF-16.
this means, you have 4 Bytes for each Character/Letter in a String.
UTF-8 will be supported - that consume only 8 Bit => 2 Bytes for each Letter in a String.
ANSI will supported, too (also 4 Bit => 1 Bytes => 255 Characters)

On this fact, it give different Windows API Function's

They are splitted into A and W Function's like MessageBoxA or MessageBoxW.

A stands for Ansi (255 Letters for each Byte)
W stands for WideChar/UTF-8 and UTF-16
« Last Edit: July 02, 2025, 01:01:15 pm by paule32 »
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

cdbc

  • Hero Member
  • *****
  • Posts: 2816
    • http://www.cdbc.dk
Re: 4-bytes characters bug!
« Reply #6 on: July 02, 2025, 01:45:16 pm »
Hi
@paule32: You really shouldn't talk about things, that you do not understand:
Quote
Windows use 16 Bit Encoding UTF-16.
this means, you have 4 Bytes for each Character/Letter in a String.
UTF-8 will be supported - that consume only 8 Bit => 2 Bytes for each Letter in a String.
ANSI will supported, too (also 4 Bit => 1 Bytes => 255 Characters)

On this fact, it give different Windows API Function's

They are splitted into A and W Function's like MessageBoxA or MessageBoxW.

A stands for Ansi (255 Letters for each Byte)
W stands for WideChar/UTF-8 and UTF-16
I really hope, that no 'noob's read the above quote... PLEASE GO STUDY STRINGS SUBJECT! paule!!!
Since when is 4 bits = 1 byte?!?
The difference between 'Ansi' and 'Ascii'?!?
My condolences paule - Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE6/QT6 -> FPC Release -> Lazarus Release &  FPC Main -> Lazarus Main

paule32

  • Hero Member
  • *****
  • Posts: 647
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #7 on: July 02, 2025, 02:16:06 pm »
yes, shame on my head...  :o

I was squirel a little bit.

the original ANSI Letters starts by 0 till 127.
the extended ANSI Letters ends by 255.

So, you have two 4 Bits of each (4 for original, and 4 for extended Character Set).
0 0 0 0 - 1 1 1 1 = 8 Bits = 1 Byte
16        *      16 = 256 - 1
« Last Edit: July 02, 2025, 02:20:35 pm by paule32 »
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

Thaddy

  • Hero Member
  • *****
  • Posts: 19268
  • Glad to be alive.
Re: 4-bytes characters bug!
« Reply #8 on: July 02, 2025, 02:23:30 pm »
ASCII is 7 bit, not 8. But it is expanded to 8 in normal code.
objects are fine constructs. You can even initialize them with constructors.

paule32

  • Hero Member
  • *****
  • Posts: 647
  • One in all. But, not all in one.
Re: 4-bytes characters bug!
« Reply #9 on: July 02, 2025, 02:29:38 pm »
yes, 7 Bit.
but CPU calculate in 8 Bit, so, in ANSI you left 1 Bit - but nobody realized it.

The 7 Bit is an artefact knowledge of the first e-Mail through two universities.
The people thoughts that 7 Bit are enough for tele type machines...

But some month later, the global players needs more comfort...
And it will never stop...
MS-IIS - Internet Information Server, Apache, PHP/HTML/CSS, MinGW-32/64 MSys2 GNU C/C++ 13 (-stdc++20), FPC 3.2.2
A Friend in need, is a Friend indeed.

PascalDragon

  • Hero Member
  • *****
  • Posts: 6397
  • Compiler Developer
Re: 4-bytes characters bug!
« Reply #10 on: July 03, 2025, 09:21:14 pm »
The problem is you are not taking into account that the UI controls on Windows use a different character encoding than your strings are using. Your strings are using UTF-8, but on Windows your UI controls do not use UTF-8, they use ANSI or UTF-16, depending on how you create them. You need to take the UI encoding into account when translating characters/indexes between your UI and your strings. It is not a 1:1 mapping.

Well, with current Windows versions (starting from some version of Windows 10) you can enable UTF-8 as an encoding for the A API. Lazarus provides a switch for this in the manifest settings.

Thaddy

  • Hero Member
  • *****
  • Posts: 19268
  • Glad to be alive.
Re: 4-bytes characters bug!
« Reply #11 on: July 04, 2025, 06:18:44 pm »
Richmemo is not part of the LCL.
objects are fine constructs. You can even initialize them with constructors.

 

TinyPortal © 2005-2018