Print Page - more UTF8 confusing

Free Pascal => General => Topic started by: lazer on February 03, 2023, 09:02:50 pm

Title: more UTF8 confusing
Post by: lazer on February 03, 2023, 09:02:50 pm

Hi, I'm having UTF8 troubles again. :(

I save my text files, which include accented characters, with featherpad on linux choosing utf8 encoding.
I read them into structures based on the following types.

Code: Pascal [Select][+]

type
 
TwordArray = array[0..2] of string[phraselen];
 
Tcard=record
  cardwords:TWordArray;
  phrase:string[phraselen] ;
end;
 

When I display them in a Tpanel.caption , it works fine.

Code: Pascal [Select][+]

        textPanels[i].panelLabel.caption:= cardstr;
 

However, I need to copy one char at a time into a grid of TstaticText controls.

Code: Pascal [Select][+]

    grid[posx][posy+i].Caption:=sisword[i];  

This works fine for a..z A..Z but obviously gets confused by multibyte chars.

How can I tell now long each letter is to copy each one correctly to a separte TstaticText ??

TIA.

Title: Re: more UTF8 confusing
Post by: KodeZwerg on February 03, 2023, 09:14:05 pm

UTF8Copy (https://lazarus-ccr.sourceforge.io/docs/lazutils/lazutf8/utf8copy.html) may help you.

Title: Re: more UTF8 confusing
Post by: paweld on February 03, 2023, 09:14:37 pm

Code: Pascal [Select][+]

uses  LazUTF8;
  //...    
  grid[posx][posy+i].Caption := UTF8Copy(sisword, i, 1); 

Title: Re: more UTF8 confusing
Post by: KodeZwerg on February 03, 2023, 09:16:40 pm

and in addition you may use UTF8Length (https://lazarus-ccr.sourceforge.io/docs/lazutils/lazutf8/utf8length.html) to be inside legit limit

Title: Re: more UTF8 confusing
Post by: lazer on February 03, 2023, 10:11:01 pm

Code: Pascal [Select][+]

    grid[posx][posy+i].Caption:=UTF8copy(sisword,i,1);    

Not seeing any better results.

Title: Re: more UTF8 confusing
Post by: KodeZwerg on February 03, 2023, 10:21:44 pm

Can you attach a demo project that show your problem?

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 03, 2023, 11:38:43 pm

Code: Pascal [Select][+]

program unicode;
 
{$mode objfpc}
{$h+}
{$codepage utf8}
 
uses
  types;
 
function utf8chars(const str_in: string; withCombiningDiacriticals: boolean = true): TStringDynArray;
  procedure primary(const len: integer; i: integer=1; n: integer = 0);
    procedure secondary(const n_bytes: integer);
      begin
        result[n] := copy(str_in, i, n_bytes);
        inc(i, n_bytes);
        inc(n);
      end;
    begin
      setlength(result, len);
      while i <= len do secondary(Utf8CodePointLen(@str_in[i], maxInt, withCombiningDiacriticals));
      setlength(result, n);
    end;
  begin
    result := default(TStringDynArray);
    primary(length(str_in));
  end;
 
const
  boo: string = 'ábcdéfghíǝ́Á̊ÅÁǺÁwow!';
 
var
  str: string;
 
begin
  writeln(boo);
  for str in utf8chars(boo) do writeln(str);
end.

Utf8CodePointLen is also beneficial

utf8chars above will give the proper length for most strings (the length of the resulting array), providing the diacritical markers are combined correctly.
Each element in the array will be a unicode code point... (well, not exactly...) each element is a string which is supposed to contain one unicode character (which can be multi-btye).

Title: Re: more UTF8 confusing
Post by: paweld on February 04, 2023, 07:49:35 am

@Bogen85: https://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 04, 2023, 08:01:26 am

Quote from: paweld on February 04, 2023, 07:49:35 am

@Bogen85: https://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code

It is confusing to me that both Lazarus and Free Pascal both have units that provide similar functionality.

I know OP is expressly using Lazarus, but many Free Pascal programs not using Lazarus units need to do similar things with UTF8.

So duplicate functionality exists, but with different functions names and parameters for those...

So I find this confusing concerning FreePascal and UTF8, but not for the same reasons as OP most likely.
However, this is posted in Free Pascal General, and not in a Lazarus specific area...

Title: Re: more UTF8 confusing
Post by: lazer on February 04, 2023, 09:06:24 am

Wow, I had no idea of vipers next I was walking into just wanting a little twiddly bit on the bottom of the letter c !!!

Many thanks to Bogen85 for that full and explicit code sample. I would never have got to that. I'm not even sure I understand the syntax of that procedure in procedure in function thing. I never knew that was possible !

It is very unfortunate that this was not done in a coordinated way between fpc and Lazarus.

Anyway, it seems to be doing what I need now, so huge thanks for that code. It's insane that it's that complicated but at least I have a solution and have learnt a few new tricks with fpc.

8-)

Title: Re: more UTF8 confusing
Post by: JuhaManninen on February 04, 2023, 11:28:56 am

Quote from: Bogen85 on February 04, 2023, 08:01:26 am

It is confusing to me that both Lazarus and Free Pascal both have units that provide similar functionality.
I know OP is expressly using Lazarus, but many Free Pascal programs not using Lazarus units need to do similar things with UTF8.
So duplicate functionality exists, but with different functions names and parameters for those...

Yes, the reason is that Lazarus UTF-8 solution was made before FPC had such library functions.
Ideally Lazarus should now start to use the FPC library funcs, but they are done in a very different way.
I looked at :
function Utf8CodePointLen(P: PAnsiChar; MaxLookAhead: SizeInt; IncludeCombiningDiacriticalMarks: Boolean): SizeInt;
It is similar with function UTF8CodepointSize in unit LazUTF8 in package LazUtils. However it has a parameter MaxLookAhead which is used only for checking validity. Why a user should provide such a value? At least it should have a default value.
The parameter IncludeCombiningDiacriticalMarks is wrong in a function called Utf8CodePointLen. A combining diacritical mark is another CodePoint, yet the function name suggests that the length of only one codepoint is returned.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The LazUtils function UTF8CodepointSize has one parameter and is well optimized. I would not switch it to the FPC's function now.

Quote from: lazer

It is very unfortunate that this was not done in a coordinated way between fpc and Lazarus.
Anyway, it seems to be doing what I need now, so huge thanks for that code. It's insane that it's that complicated but at least I have a solution and have learnt a few new tricks with fpc.

True.
The complication comes from Unicode standard itself. It is super complicated.

BTW, the LazUtils package can be used also in console programs. Recommended.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 04, 2023, 03:09:29 pm

Quote from: lazer on February 04, 2023, 09:06:24 am

Wow, I had no idea of vipers next I was walking into just wanting a little twiddly bit on the bottom of the letter c !!!

"little twiddly bit on the bottom of the letter c" likely has to do with that being multiple multi-byte unicode endpoints (as it is part of combined set that is using diacritical markers), and not just a single muli-byte unicode endpoint. (at least that would be my guess).

Quote from: lazer on February 04, 2023, 09:06:24 am

Many thanks to Bogen85 for that full and explicit code sample. I would never have got to that. I'm not even sure I understand the syntax of that procedure in procedure in function thing. I never knew that was possible !

Well, I grabbed what I had that I'd used trying to figure something else out... (and originally not intended to share that code in that form (double nested...), it has do for me, lack of const variables in free pascal plus disconnect between declaration and assignment..., but I digress...)

Glad I could help!

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 04, 2023, 03:22:07 pm

Quote from: JuhaManninen on February 04, 2023, 11:28:56 am

Ideally Lazarus should now start to use the FPC library funcs, but they are done in a very different way.
I looked at :
function Utf8CodePointLen(P: PAnsiChar; MaxLookAhead: SizeInt; IncludeCombiningDiacriticalMarks: Boolean): SizeInt;
It is similar with function UTF8CodepointSize in unit LazUTF8 in package LazUtils. However it has a parameter MaxLookAhead which is used only for checking validity. Why a user should provide such a value? At least it should have a default value.
The parameter IncludeCombiningDiacriticalMarks is wrong in a function called Utf8CodePointLen. A combining diacritical mark is another CodePoint, yet the function name suggests that the length of only one codepoint is returned.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The LazUtils function UTF8CodepointSize has one parameter and is well optimized. I would not switch it to the FPC's function now.

Yes, the MaxLookAhead is odd. It needs to be set high enough to grab all the diacritical markers that are combined with the initial codepoint, so that all codepoints making up the "character" (which is a confusing slighty ambiguous term in Unicode...)

Setting it too low, it won't grab enough, but I've not found that it can be set too high, as it never grabs subsequent endpoints that are not part of the first combined set.

Quote from: JuhaManninen on February 04, 2023, 11:28:56 am

BTW, the LazUtils package can be used also in console programs. Recommended.

I will take a look.

I tend to not use LCL units as I get a lot of warnings/hints from using them (which I always have set to be errors in my compile flags) so to not have to fiddle with using LCL units in a "special" manner I just avoid them and stick with FCL (as I don't get those kinds of warnings/hints from them) and my own units.

Title: Re: more UTF8 confusing
Post by: JuhaManninen on February 04, 2023, 06:16:49 pm

Quote from: Bogen85 on February 04, 2023, 03:22:07 pm

I tend to not use LCL units as I get a lot of warnings/hints from using them (which I always have set to be errors in my compile flags) so to not have to fiddle with using LCL units in a "special" manner I just avoid them and stick with FCL (as I don't get those kinds of warnings/hints from them) and my own units.

LazUtils does not depend on LCL.
LCL depends on LazUtils.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 04, 2023, 10:34:15 pm

Alright, I tried UTF8CodepointSize from lazutf8,

(had to create a wrapper unit, which I don't need to do with any unit I use from the FPC install, but that is a another issue..., maybe I'm doing something wrong in how I'm specifying where to get the Lazarus units from, which are in sub-directories of the directory that Lazarus is installed in)

Code: Pascal [Select][+]

// lazutf8.pp
{$push}
{$warnings off}
{$hints off}
{$notes off}
{$include lazutf8.pas}
{$pop}

It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.

Utf8CodePointLen which I'm using does return the correct number of bytes. (The ASCII character, 1 byte, plus the bytes for each diacritical marker).

UTF8CodepointSize does return the correct number of bytes for multi-byte unicode endpoints, but since it does not check for diacritical markers it won't tell you how many bytes for what ends up being a single display character (which can be more than 4 bytes).

I don't see anything in lazutf8 that works.

I tried all of these:

Code: Pascal [Select][+]

UTF8CodepointSize(pChar(str)),  // had to disable note for declared inline but not inlined
UTF8CharacterLength(pChar(str)), // deprecated, but I tried it anyways
UTF8CodepointStrictSize(pChar(str)),
UTF8CharacterStrictLength(pChar(str)), // deprecated, but I tried it anyways
UTF8Length(pChar(str)),
UTF8LengthFast(pChar(str)),

They all come up short as far as number of bytes in the displayed character when trailing diacritical markers are present.

So what from lazutils (or lazutf8) can be used to get the number of bytes in the string that are for the displayed character?

If there are none, I'll need to continue using Utf8CodePointLen like I'm already doing, even if the functions in lazutf8 and lazutils are the recommended ones.

Title: Re: more UTF8 confusing
Post by: circular on February 04, 2023, 11:30:51 pm

In fact, parsing letters can be complicated. Glyphs are put together at different levels:
- byte (for ASCII that's sufficient)
- unicode char (some code points already include accents)
- "multichars": letter with its non spacing marks, which are zero-length unicode characters (the list of unicode values for non-spacing mark is not trivial and can change with new versions of Unicode)
- merge some letters together for example right-to-left Arabic letters ل and ا become not ل‍ا but لا.

So, in the end, a single glyph can be the following unicode "chars": letter + mark + letter + mark.

I've made an implementation of that in the TGlyphCursorUtf8 class of BGRAUTF8 unit. There are still things it doesn't handle, for example I know there are some Indian letters that are supposed to merge together.

More explanations there:
https://forum.lazarus.freepascal.org/index.php/topic,49750.msg361630.html#msg361630

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 12:44:39 am

Utf8CodePointLen and it's max look ahead are very poorly named.

The max look ahead is not a max... It is more of a max allowed for diacritical markers that need be combined with the base code point.

And it is not a single code point size...

It is the size of the base code point in bytes plus all the sizes of the diacritical marker points to be combined into the final displayed character.

To be honest, I stumbled across OP lazer's solution by accident, that was not the problem I was trying to solve originally for myself. I was just wanting to use a function that was not part of Lazarus, saw that the IncludeCombiningDiacriticalMarks parameter, looked those up as I'd never heard of them (I mean in the unicode sense, of course I'd seen them on character glyphs), and fiddled with it until I got them working by setting MaxLookAhead high enough....

I was just looking for a function to return the number of bytes in the code point, and nothing more...

Quote from: JuhaManninen on February 04, 2023, 11:28:56 am

There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).

Agreed....

But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...

I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)

My programs are not directly GUI related (currently) so I'm relying on text editor (cuda text), terminal (WezTerm), and various web browsers to render the displayed unicode characters correctly...
Web browsers seem to get things right more often than my text editor and terminal...

Title: Re: more UTF8 confusing
Post by: JuhaManninen on February 05, 2023, 12:59:02 am

Quote from: Bogen85 on February 04, 2023, 10:34:15 pm

Alright, I tried UTF8CodepointSize from lazutf8,
...
It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.

It returns the number of bytes in one codepoint as the function name suggests. The following diacritical markers are also codepoints. IMO a function counting them all should be named differently.

Quote

If there are none, I'll need to continue using Utf8CodePointLen like I'm already doing, even if the functions in lazutf8 and lazutils are the recommended ones.

Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.
I recommended LazUtils package in general. It has many things, not only Unicode stuff.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 01:11:55 am

Quote from: JuhaManninen on February 05, 2023, 12:59:02 am

Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.

Yeah using it with the understanding that it is not named correctly and the max look ahead parameter also has a misleading name. I should look at the code for it...

Quote from: JuhaManninen on February 05, 2023, 12:59:02 am

I recommended LazUtils package in general. It has many things, not only Unicode stuff.

Yeah, there were several things here I was not using that I should be:
https://wiki.freepascal.org/LazUtils_Documentation_Roadmap

Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 07:16:43 am

Utf8CodePointLen was updated recently and the logic is straight forward.

https://gitlab.com/freepascal.org/fpc/source/-/blob/b38d13577f94364b4c7ba6f4d6b032eae404e934/rtl/inc/generic.inc#L1147

There is no penalty with MaxLookAhead being set to high.

After the first code point it collects additional code points, but only if they are diacritical markers. As soon as it hits a code point that is not a diacritical marker, it stops, and only the preceding diacritical marker byte counts are added to the total.

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.

It is only the Utf8CodePointLen that is misleading, but is only misleading if IncludeCombiningDiacriticalMarks is enabled.

When IncludeCombiningDiacriticalMarks is enabled the the name of the function effectively is something like CombiedUtf8CodePointLengthsOfBaseCodePointAndDiacriticalMarks.

Title: Re: more UTF8 confusing
Post by: JuhaManninen on February 05, 2023, 09:56:44 am

Quote from: Bogen85 on February 05, 2023, 07:16:43 am

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.

In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.

Quote

Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.

Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.

Title: Re: more UTF8 confusing
Post by: circular on February 05, 2023, 10:04:18 am

Quote from: JuhaManninen on February 04, 2023, 11:28:56 am

There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).

The term "character" seems ambiguous to me. In Unicode documentation, character is equivalent to code point.

At first glance, we are talking about characters of category Mark (M) and subcategory non-spacing marks (Mn). Though there are other categories of marks, for example U+20E3 which is an enclosing mark (Me): A + ⃣ = A⃣

Note that if a Unicode string starts with a Mark, then this mark is in fact a whole glyph. In this example of the enclosing mark, alone it is a rounded rectangle: ⃣

So we could call it CharacterWithMarks: the character can be itself a mark, but that's actually what we want. That's what does Utf8CodePointLen (the updated version seem to cover more cases of marks).

Title: Re: more UTF8 confusing
Post by: circular on February 05, 2023, 10:12:42 am

Quote from: JuhaManninen on February 05, 2023, 09:56:44 am

In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.

If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.

Quote from: Bogen85 on February 05, 2023, 07:16:43 am

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.

I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.

EDIT: never mind I thought there was a bug but there isn't

Title: Re: more UTF8 confusing
Post by: circular on February 05, 2023, 12:50:20 pm

Quote from: Bogen85 on February 05, 2023, 12:44:39 am

But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...

I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)

Regarding ligatures, there are two different things to consider.

1. a letter may have a different shape depending on its surroundings. You can still isolate it, but if you draw it on its own, the shape will be the isolated form. To draw the correct shape, you need to add U+200D (Zero Width Joiner) before or after. For example, the letter bah ب with ligature on the left and right will be ﺒ that you can draw independently.

2. two letters can merge together. It is rare but it can happen. I gave above the example of lam ل + aleph ا that becomes a new glyph لا instead of the usual ligature ل‍ا (the two combined look like a U).

As the way letters combine depends on the context, it is not possible to do a function that, form the remaining bytes, will return the next combinations of characters that correspond to a glyph, because the preceding bytes matter. For example, if you insert U+202D (Left-to-Right Override) the Arabic letters will go from left to right, and so lam ل + aleph ا will be ال so no merging here.

So basically, you need to do a full bidirectional analysis of the text in order to know what letters can merge and where the ligatures are. Anyway, if you have bidirectional text, you would need this information to know where to display each letter. You cannot assume the will be from left to right.

Oh I forgot to mention, I just discovered that the emojis can also combine even though they are not marks. An interesting read here: https://stackoverflow.com/questions/66062139/combining-some-unicode-nonspacing-marks-with-associated-letters-for-uniform-pr

Unicode documentation on Emoji:
https://www.unicode.org/reports/tr51/

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 02:47:45 pm

Quote from: circular on February 05, 2023, 10:12:42 am

Quote from: JuhaManninen on February 05, 2023, 09:56:44 am
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.
If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.

Quote from: Bogen85 on February 05, 2023, 07:16:43 am
After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.
I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.

EDIT: never mind I thought there was a bug but there isn't

I'm not sure if your "never mind" is in regard to the end of the string.
Looking at the code it does not keep going to the end of the string, it stops when the code point is is not a diacritical marker, and that code point that is not a diacritical marker is not included in the byte count.
The diacritical markers are in specific code point ranges.

EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.

But my splitter function has a bug, it is splitting up ligature combinations, which I do need to figure out how to correct.
EDIT: Which you did comment on, and I'll need to dig into that.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 03:50:11 pm

Quote from: JuhaManninen on February 05, 2023, 09:56:44 am

Quote from: Bogen85 on February 05, 2023, 07:16:43 am
Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.
Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.

I'll create a separate topic/thread for this: https://forum.lazarus.freepascal.org/index.php/topic,62176.0.html

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 05:41:29 pm

Quote from: circular on February 05, 2023, 12:50:20 pm

Oh I forgot to mention, I just discovered that the emojis can also combine even though they are not marks. An interesting read here: https://stackoverflow.com/questions/66062139/combining-some-unicode-nonspacing-marks-with-associated-letters-for-uniform-pr
Unicode documentation on Emoji:
https://www.unicode.org/reports/tr51/

Yeah, which means that a function like my displayable character unit splitter can't be run on a source code snippet, but which is fine, one would just need to be aware such a splitter is not for all categories of UTF-8 text.

Title: Re: more UTF8 confusing
Post by: circular on February 05, 2023, 05:52:41 pm

Quote from: Bogen85 on February 05, 2023, 02:47:45 pm

EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.

It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 05:55:07 pm

Quote from: circular on February 05, 2023, 05:52:41 pm

Quote from: Bogen85 on February 05, 2023, 02:47:45 pm
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

Oh yeah! Good point, I'd not thought of that. There is no reason for it to the set to anything longer anyways.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 06:11:14 pm

Quote from: Bogen85 on February 05, 2023, 05:55:07 pm

Quote from: circular on February 05, 2023, 05:52:41 pm
Quote from: Bogen85 on February 05, 2023, 02:47:45 pm
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

Oh yeah! Good point, I'd not thought of that. There is no reason for it to the set to anything longer anyways.

Code: Pascal [Select][+]

function utf8DisplayedChars (const str_in: string; const withCombiningDiacriticals: boolean = true): TStringDynArray;
  procedure primary (const len: integer; i: integer = 1; n: integer = 0);
    procedure secondary (const n_bytes: integer);
      begin
        result[n] := copy(str_in, i, n_bytes);
        inc(i, n_bytes);
        inc(n);
      end;
 
    begin
      setlength(result, len);
      while i <= len do secondary(Utf8CodePointLen(@str_in[i], (len - i) + 1, withCombiningDiacriticals));
      setlength(result, n);
    end;
 
  begin
    result := default(TStringDynArray);
    primary(length(str_in));
  end;

Utf8CodePointLen throws an exception without the + 1 following the (len - i).
Strings indices start at 1, not 0, so the + 1 should be needed anyways.

Title: Re: more UTF8 confusing
Post by: Bogen85 on February 05, 2023, 07:53:31 pm

Quote from: circular on February 05, 2023, 05:52:41 pm

Quote from: Bogen85 on February 05, 2023, 02:47:45 pm
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

However, with an ansistring it would hit the null-terminating character, which would fail the checks for being a diacritical marker. So it would return without including the null as part of the byte count.

Title: Re: more UTF8 confusing
Post by: circular on February 05, 2023, 08:05:14 pm

Indeed.