Recent

Author Topic: more UTF8 confusing  (Read 4505 times)

circular

  • Hero Member
  • *****
  • Posts: 4356
    • Personal webpage
Re: more UTF8 confusing
« Reply #15 on: February 04, 2023, 11:30:51 pm »
In fact, parsing letters can be complicated. Glyphs are put together at different levels:
- byte (for ASCII that's sufficient)
- unicode char (some code points already include accents)
- "multichars": letter with its non spacing marks, which are zero-length unicode characters (the list of unicode values for non-spacing mark is not trivial and can change with new versions of Unicode)
- merge some letters together for example right-to-left Arabic letters ل and ا become not ل‍ا but لا.

So, in the end, a single glyph can be the following unicode "chars": letter + mark + letter + mark.

I've made an implementation of that in the TGlyphCursorUtf8 class of BGRAUTF8 unit. There are still things it doesn't handle, for example I know there are some Indian letters that are supposed to merge together.

More explanations there:
https://forum.lazarus.freepascal.org/index.php/topic,49750.msg361630.html#msg361630
Conscience is the debugger of the mind

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #16 on: February 05, 2023, 12:44:39 am »
Utf8CodePointLen and it's max look ahead are very poorly named.

The max look ahead is not a max... It is more of a max allowed for diacritical markers that need be combined with the base code point.

And it is not a single code point size...

It is the size of the base code point in bytes plus all the sizes of the diacritical marker points to be combined into the final displayed character.

To be honest, I stumbled across OP lazer's solution by accident, that was not the problem I was trying to solve originally for myself. I was just wanting to use a function that was not part of Lazarus, saw that the IncludeCombiningDiacriticalMarks parameter, looked those up as I'd never heard of them (I mean in the unicode sense, of course I'd seen them on character glyphs), and fiddled with it until I got them working by setting MaxLookAhead high enough....

I was just looking for a function to return the number of bytes in the code point, and nothing more...

There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).

Agreed....

But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...

I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)

My programs are not directly GUI related (currently) so I'm relying on text editor (cuda text), terminal (WezTerm), and various web browsers to render the displayed unicode characters correctly...
Web browsers seem to get things right more often than my text editor and terminal...
« Last Edit: February 05, 2023, 01:02:57 am by Bogen85 »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4541
  • I like bugs.
Re: more UTF8 confusing
« Reply #17 on: February 05, 2023, 12:59:02 am »
Alright, I tried UTF8CodepointSize from lazutf8,
...
It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.
It returns the number of bytes in one codepoint as the function name suggests. The following diacritical markers are also codepoints. IMO a function counting them all should be named differently.

Quote
If there are none, I'll need to continue using Utf8CodePointLen like I'm already doing, even if the functions in lazutf8 and lazutils are the recommended ones.
Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.
I recommended LazUtils package in general. It has many things, not only Unicode stuff.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #18 on: February 05, 2023, 01:11:55 am »
Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.

Yeah using it with the understanding that it is not named correctly and the max look ahead parameter also has a misleading name. I should look at the code for it...

I recommended LazUtils package in general. It has many things, not only Unicode stuff.

Yeah, there were several things here I was not using that I should be:
https://wiki.freepascal.org/LazUtils_Documentation_Roadmap

Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.
« Last Edit: February 05, 2023, 02:32:54 am by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #19 on: February 05, 2023, 07:16:43 am »
Utf8CodePointLen was updated recently and the logic is straight forward.

https://gitlab.com/freepascal.org/fpc/source/-/blob/b38d13577f94364b4c7ba6f4d6b032eae404e934/rtl/inc/generic.inc#L1147

There is no penalty with MaxLookAhead being set to high.

After the first code point it collects additional code points, but only if they are diacritical markers. As soon as it hits a code point that is not a diacritical marker, it stops, and only the preceding diacritical marker byte counts are added to the total.

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.

It is only the Utf8CodePointLen that is misleading, but is only misleading if IncludeCombiningDiacriticalMarks is enabled.

When IncludeCombiningDiacriticalMarks is enabled the the name of the function effectively is something like CombiedUtf8CodePointLengthsOfBaseCodePointAndDiacriticalMarks.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4541
  • I like bugs.
Re: more UTF8 confusing
« Reply #20 on: February 05, 2023, 09:56:44 am »
After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.

Quote
Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.
Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

circular

  • Hero Member
  • *****
  • Posts: 4356
    • Personal webpage
Re: more UTF8 confusing
« Reply #21 on: February 05, 2023, 10:04:18 am »
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The term "character" seems ambiguous to me. In Unicode documentation, character is equivalent to code point.

At first glance, we are talking about characters of category Mark (M) and subcategory non-spacing marks (Mn). Though there are other categories of marks, for example U+20E3 which is an enclosing mark (Me): A + ​⃣ = A⃣

Note that if a Unicode string starts with a Mark, then this mark is in fact a whole glyph. In this example of the enclosing mark, alone it is a rounded rectangle: ​⃣

So we could call it CharacterWithMarks: the character can be itself a mark, but that's actually what we want. That's what does Utf8CodePointLen (the updated version seem to cover more cases of marks).
Conscience is the debugger of the mind

circular

  • Hero Member
  • *****
  • Posts: 4356
    • Personal webpage
Re: more UTF8 confusing
« Reply #22 on: February 05, 2023, 10:12:42 am »
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.
If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.
I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.

EDIT: never mind I thought there was a bug but there isn't
« Last Edit: February 05, 2023, 10:15:03 am by circular »
Conscience is the debugger of the mind

circular

  • Hero Member
  • *****
  • Posts: 4356
    • Personal webpage
Re: more UTF8 confusing
« Reply #23 on: February 05, 2023, 12:50:20 pm »
But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...

I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)
Regarding ligatures, there are two different things to consider.

1. a letter may have a different shape depending on its surroundings. You can still isolate it, but if you draw it on its own, the shape will be the isolated form. To draw the correct shape, you need to add U+200D (Zero Width Joiner) before or after. For example, the letter bah ب with ligature on the left and right will be ﺒ that you can draw independently.

2. two letters can merge together. It is rare but it can happen. I gave above the example of lam ل + aleph ا that becomes a new glyph لا instead of the usual ligature ل‍ا (the two combined look like a U).

As the way letters combine depends on the context, it is not possible to do a function that, form the remaining bytes, will return the next combinations of characters that correspond to a glyph, because the preceding bytes matter. For example, if you insert U+202D (Left-to-Right Override) the Arabic letters will go from left to right, and so lam ل + aleph ا will be ال so no merging here.

So basically, you need to do a full bidirectional analysis of the text in order to know what letters can merge and where the ligatures are. Anyway, if you have bidirectional text, you would need this information to know where to display each letter. You cannot assume the will be from left to right.

Oh I forgot to mention, I just discovered that the emojis can also combine even though they are not marks. An interesting read here: https://stackoverflow.com/questions/66062139/combining-some-unicode-nonspacing-marks-with-associated-letters-for-uniform-pr

Unicode documentation on Emoji:
https://www.unicode.org/reports/tr51/
« Last Edit: February 05, 2023, 01:08:29 pm by circular »
Conscience is the debugger of the mind

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #24 on: February 05, 2023, 02:47:45 pm »
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.
If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.

After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.
I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.

EDIT: never mind I thought there was a bug but there isn't

I'm not sure if your "never mind" is in regard to the end of the string.
Looking at the code it does not keep going to the end of the string, it stops when the code point is is not a diacritical marker, and that code point that is not a diacritical marker is not included in the byte count.
The diacritical markers are in specific code point ranges.

EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.

But my splitter function has a bug, it is splitting up ligature combinations, which I do need to figure out how to correct.
EDIT: Which you did comment on, and I'll need to dig into that.

« Last Edit: February 05, 2023, 05:37:31 pm by Bogen85 »

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #25 on: February 05, 2023, 03:50:11 pm »
Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.
Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.

I'll create a separate topic/thread for this: https://forum.lazarus.freepascal.org/index.php/topic,62176.0.html

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #26 on: February 05, 2023, 05:41:29 pm »
Oh I forgot to mention, I just discovered that the emojis can also combine even though they are not marks. An interesting read here: https://stackoverflow.com/questions/66062139/combining-some-unicode-nonspacing-marks-with-associated-letters-for-uniform-pr
Unicode documentation on Emoji:
https://www.unicode.org/reports/tr51/

Yeah, which means that a function like my displayable character unit splitter can't be run on a source code snippet, but which is fine, one would just need to be aware such a splitter is not for all categories of UTF-8 text.

circular

  • Hero Member
  • *****
  • Posts: 4356
    • Personal webpage
Re: more UTF8 confusing
« Reply #27 on: February 05, 2023, 05:52:41 pm »
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.
Conscience is the debugger of the mind

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #28 on: February 05, 2023, 05:55:07 pm »
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

Oh yeah! Good point, I'd not thought of that. There is no reason for it to the set to anything longer anyways.

Bogen85

  • Hero Member
  • *****
  • Posts: 685
Re: more UTF8 confusing
« Reply #29 on: February 05, 2023, 06:11:14 pm »
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.
It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.

Oh yeah! Good point, I'd not thought of that. There is no reason for it to the set to anything longer anyways.


Code: Pascal  [Select][+][-]
  1. function utf8DisplayedChars (const str_in: string; const withCombiningDiacriticals: boolean = true): TStringDynArray;
  2.   procedure primary (const len: integer; i: integer = 1; n: integer = 0);
  3.     procedure secondary (const n_bytes: integer);
  4.       begin
  5.         result[n] := copy(str_in, i, n_bytes);
  6.         inc(i, n_bytes);
  7.         inc(n);
  8.       end;
  9.  
  10.     begin
  11.       setlength(result, len);
  12.       while i <= len do secondary(Utf8CodePointLen(@str_in[i], (len - i) + 1, withCombiningDiacriticals));
  13.       setlength(result, n);
  14.     end;
  15.  
  16.   begin
  17.     result := default(TStringDynArray);
  18.     primary(length(str_in));
  19.   end;

Utf8CodePointLen throws an exception without the + 1 following the (len - i).
Strings indices start at 1, not 0, so the + 1 should be needed anyways.

 

TinyPortal © 2005-2018