Recent

Author Topic: more UTF8 confusing  (Read 3452 times)

lazer

  • Full Member
  • ***
  • Posts: 215
more UTF8 confusing
« on: February 03, 2023, 09:02:50 pm »
Hi, I'm having UTF8 troubles again. :(

I save my text files, which include accented characters, with featherpad on linux choosing utf8 encoding.
I read them into structures based on the following types.

Code: Pascal  [Select][+][-]
  1. type
  2.  
  3. TwordArray = array[0..2] of string[phraselen];
  4.  
  5. Tcard=record
  6.   cardwords:TWordArray;
  7.   phrase:string[phraselen] ;
  8. end;
  9.  

When I display them in a Tpanel.caption , it works fine.
Code: Pascal  [Select][+][-]
  1.         textPanels[i].panelLabel.caption:= cardstr;
  2.  

However, I need to copy one char at a time into a grid of TstaticText controls.
Code: Pascal  [Select][+][-]
  1.     grid[posx][posy+i].Caption:=sisword[i];  

This works fine for a..z A..Z but obviously gets confused by multibyte chars.

How can I tell now long each letter is to copy each one correctly to a separte TstaticText ??

TIA.

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2006
  • Fifty shades of code.
    • Delphi & FreePascal
Re: more UTF8 confusing
« Reply #1 on: February 03, 2023, 09:14:05 pm »
UTF8Copy may help you.
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

paweld

  • Hero Member
  • *****
  • Posts: 966
Re: more UTF8 confusing
« Reply #2 on: February 03, 2023, 09:14:37 pm »
Code: Pascal  [Select][+][-]
  1. uses  LazUTF8;
  2.   //...    
  3.   grid[posx][posy+i].Caption := UTF8Copy(sisword, i, 1);
Best regards / Pozdrawiam
paweld

KodeZwerg

  • Hero Member
  • *****
  • Posts: 2006
  • Fifty shades of code.
    • Delphi & FreePascal
Re: more UTF8 confusing
« Reply #3 on: February 03, 2023, 09:16:40 pm »
and in addition you may use UTF8Length to be inside legit limit
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

lazer

  • Full Member
  • ***
  • Posts: 215
Re: more UTF8 confusing
« Reply #4 on: February 03, 2023, 10:11:01 pm »
Code: Pascal  [Select][+][-]
  1.     grid[posx][posy+i].Caption:=UTF8copy(sisword,i,1);    

Not seeing any better results.


KodeZwerg

  • Hero Member
  • *****
  • Posts: 2006
  • Fifty shades of code.
    • Delphi & FreePascal
Re: more UTF8 confusing
« Reply #5 on: February 03, 2023, 10:21:44 pm »
Can you attach a demo project that show your problem?
« Last Edit: Tomorrow at 31:76:97 xm by KodeZwerg »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: more UTF8 confusing
« Reply #6 on: February 03, 2023, 11:38:43 pm »
Code: Pascal  [Select][+][-]
  1. program unicode;
  2.  
  3. {$mode objfpc}
  4. {$h+}
  5. {$codepage utf8}
  6.  
  7. uses
  8.   types;
  9.  
  10. function utf8chars(const str_in: string; withCombiningDiacriticals: boolean = true): TStringDynArray;
  11.   procedure primary(const len: integer; i: integer=1; n: integer = 0);
  12.     procedure secondary(const n_bytes: integer);
  13.       begin
  14.         result[n] := copy(str_in, i, n_bytes);
  15.         inc(i, n_bytes);
  16.         inc(n);
  17.       end;
  18.     begin
  19.       setlength(result, len);
  20.       while i <= len do secondary(Utf8CodePointLen(@str_in[i], maxInt, withCombiningDiacriticals));
  21.       setlength(result, n);
  22.     end;
  23.   begin
  24.     result := default(TStringDynArray);
  25.     primary(length(str_in));
  26.   end;
  27.  
  28. const
  29.   boo: string = 'ábcdéfghíǝ́Á̊ÅÁǺÁwow!';
  30.  
  31. var
  32.   str: string;
  33.  
  34. begin
  35.   writeln(boo);
  36.   for str in utf8chars(boo) do writeln(str);
  37. end.

Utf8CodePointLen is also beneficial

utf8chars above will give the proper length for most strings (the length of the resulting array), providing the diacritical markers are combined correctly.
Each element in the array will be a unicode code point... (well, not exactly...) each element is a string which is supposed to contain one unicode character (which can be multi-btye).
« Last Edit: February 05, 2023, 12:27:06 am by Bogen85 »

paweld

  • Hero Member
  • *****
  • Posts: 966
Best regards / Pozdrawiam
paweld

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: more UTF8 confusing
« Reply #8 on: February 04, 2023, 08:01:26 am »
@Bogen85: https://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

It is confusing to me that both Lazarus and Free Pascal both have units that provide similar functionality.

I know OP is expressly using Lazarus, but many Free Pascal programs not using Lazarus units need to do similar things with UTF8.

So duplicate functionality exists, but with different functions names and parameters for those...

So I find this confusing concerning FreePascal and UTF8, but not for the same reasons as OP most likely.
However, this is posted in Free Pascal General, and not in a Lazarus specific area...

lazer

  • Full Member
  • ***
  • Posts: 215
Re: more UTF8 confusing
« Reply #9 on: February 04, 2023, 09:06:24 am »
Wow, I had no idea of vipers next I was walking into just wanting a little twiddly bit on the bottom of the letter c !!!

Many thanks to Bogen85 for that full and explicit code sample.  I would never have got to that. I'm not even sure I understand the syntax of that procedure in procedure in function thing.  I never knew that was possible !

It is very unfortunate that this was not done in a coordinated way between fpc and Lazarus.

Anyway, it seems to be doing what I need now, so huge thanks for that code. It's insane that it's that complicated but at least I have a solution and have learnt a few new tricks with fpc.

 8-)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4458
  • I like bugs.
Re: more UTF8 confusing
« Reply #10 on: February 04, 2023, 11:28:56 am »
It is confusing to me that both Lazarus and Free Pascal both have units that provide similar functionality.
I know OP is expressly using Lazarus, but many Free Pascal programs not using Lazarus units need to do similar things with UTF8.
So duplicate functionality exists, but with different functions names and parameters for those...
Yes, the reason is that Lazarus UTF-8 solution was made before FPC had such library functions.
Ideally Lazarus should now start to use the FPC library funcs, but they are done in a very different way.
I looked at :
 function Utf8CodePointLen(P: PAnsiChar; MaxLookAhead: SizeInt; IncludeCombiningDiacriticalMarks: Boolean): SizeInt;
It is similar with function UTF8CodepointSize in unit LazUTF8 in package LazUtils. However it has a parameter MaxLookAhead which is used only for checking validity. Why a user should provide such a value? At least it should have a default value.
The parameter IncludeCombiningDiacriticalMarks is wrong in a function called Utf8CodePointLen. A combining diacritical mark is another CodePoint, yet the function name suggests that the length of only one codepoint is returned.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The LazUtils function UTF8CodepointSize has one parameter and is well optimized. I would not switch it to the FPC's function now.

Quote from: lazer
It is very unfortunate that this was not done in a coordinated way between fpc and Lazarus.
Anyway, it seems to be doing what I need now, so huge thanks for that code. It's insane that it's that complicated but at least I have a solution and have learnt a few new tricks with fpc.
True.
The complication comes from Unicode standard itself. It is super complicated.

BTW, the LazUtils package can be used also in console programs. Recommended.
« Last Edit: February 04, 2023, 11:30:51 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: more UTF8 confusing
« Reply #11 on: February 04, 2023, 03:09:29 pm »
Wow, I had no idea of vipers next I was walking into just wanting a little twiddly bit on the bottom of the letter c !!!

"little twiddly bit on the bottom of the letter c" likely has to do with that being multiple multi-byte unicode endpoints (as it is part of combined set that is using diacritical markers), and not just a single muli-byte unicode endpoint. (at least that would be my guess).

Many thanks to Bogen85 for that full and explicit code sample.  I would never have got to that. I'm not even sure I understand the syntax of that procedure in procedure in function thing.  I never knew that was possible !

Well, I grabbed what I had that I'd used trying to figure something else out... (and originally not intended to share that code in that form (double nested...), it has do for me, lack of const variables in free pascal plus disconnect between declaration and assignment..., but I digress...)

Glad I could help!
« Last Edit: February 04, 2023, 03:28:53 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: more UTF8 confusing
« Reply #12 on: February 04, 2023, 03:22:07 pm »
Ideally Lazarus should now start to use the FPC library funcs, but they are done in a very different way.
I looked at :
 function Utf8CodePointLen(P: PAnsiChar; MaxLookAhead: SizeInt; IncludeCombiningDiacriticalMarks: Boolean): SizeInt;
It is similar with function UTF8CodepointSize in unit LazUTF8 in package LazUtils. However it has a parameter MaxLookAhead which is used only for checking validity. Why a user should provide such a value? At least it should have a default value.
The parameter IncludeCombiningDiacriticalMarks is wrong in a function called Utf8CodePointLen. A combining diacritical mark is another CodePoint, yet the function name suggests that the length of only one codepoint is returned.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The LazUtils function UTF8CodepointSize has one parameter and is well optimized. I would not switch it to the FPC's function now.

Yes, the MaxLookAhead is odd. It needs to be set high enough to grab all the diacritical markers that are combined with the initial codepoint, so that all codepoints making up the "character" (which is a confusing slighty ambiguous term in Unicode...)

Setting it too low, it won't grab enough, but I've not found that it can be set too high, as it never grabs subsequent endpoints that are not part of the first combined set.

BTW, the LazUtils package can be used also in console programs. Recommended.

I will take a look.

I tend to not use LCL units as I get a lot of warnings/hints from using them (which I always have set to be errors in my compile flags) so to not have to fiddle with using LCL units in a "special" manner I just avoid them and stick with FCL (as I don't get those kinds of warnings/hints from them) and my own units.

« Last Edit: February 05, 2023, 12:17:25 am by Bogen85 »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4458
  • I like bugs.
Re: more UTF8 confusing
« Reply #13 on: February 04, 2023, 06:16:49 pm »
I tend to not use LCL units as I get a lot of warnings/hints from using them (which I always have set to be errors in my compile flags) so to not have to fiddle with using LCL units in a "special" manner I just avoid them and stick with FCL (as I don't get those kinds of warnings/hints from them) and my own units.
LazUtils does not depend on LCL.
LCL depends on LazUtils.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bogen85

  • Hero Member
  • *****
  • Posts: 595
Re: more UTF8 confusing
« Reply #14 on: February 04, 2023, 10:34:15 pm »
Alright, I tried UTF8CodepointSize from lazutf8,

(had to create a wrapper unit, which I don't need to do with any unit I use from the FPC install, but that is a another issue..., maybe I'm doing something wrong in how I'm specifying where to get the Lazarus units from, which are in sub-directories of the directory that Lazarus is installed in)

Code: Pascal  [Select][+][-]
  1. // lazutf8.pp
  2. {$push}
  3. {$warnings off}
  4. {$hints off}
  5. {$notes off}
  6. {$include lazutf8.pas}
  7. {$pop}

It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.

Utf8CodePointLen which I'm using does return the correct number of bytes. (The ASCII character, 1 byte, plus the bytes for each diacritical marker).

UTF8CodepointSize does return the correct number of bytes for multi-byte unicode endpoints, but since it does not check for diacritical markers it won't tell you how many bytes for what ends up being a single display character (which can be more than 4 bytes).

I don't see anything in lazutf8 that works.

I tried all of these:

Code: Pascal  [Select][+][-]
  1. UTF8CodepointSize(pChar(str)),  // had to disable note for declared inline but not inlined
  2. UTF8CharacterLength(pChar(str)), // deprecated, but I tried it anyways
  3. UTF8CodepointStrictSize(pChar(str)),
  4. UTF8CharacterStrictLength(pChar(str)), // deprecated, but I tried it anyways
  5. UTF8Length(pChar(str)),
  6. UTF8LengthFast(pChar(str)),

They all come up short as far as number of bytes in the displayed character when trailing diacritical markers are present.

So what from lazutils (or lazutf8) can be used to get the number of bytes in the string that are for the displayed character?

If there are none, I'll need to continue using Utf8CodePointLen like I'm already doing, even if the functions in lazutf8 and lazutils are the recommended ones.
« Last Edit: February 04, 2023, 11:39:51 pm by Bogen85 »

 

TinyPortal © 2005-2018