@Bogen85: https://wiki.freepascal.org/Unicode_Support_in_Lazarus#CodePoint_functions_for_encoding_agnostic_code
It is confusing to me that both Lazarus and Free Pascal both have units that provide similar functionality.Yes, the reason is that Lazarus UTF-8 solution was made before FPC had such library functions.
I know OP is expressly using Lazarus, but many Free Pascal programs not using Lazarus units need to do similar things with UTF8.
So duplicate functionality exists, but with different functions names and parameters for those...
It is very unfortunate that this was not done in a coordinated way between fpc and Lazarus.True.
Anyway, it seems to be doing what I need now, so huge thanks for that code. It's insane that it's that complicated but at least I have a solution and have learnt a few new tricks with fpc.
Wow, I had no idea of vipers next I was walking into just wanting a little twiddly bit on the bottom of the letter c !!!
Many thanks to Bogen85 for that full and explicit code sample. I would never have got to that. I'm not even sure I understand the syntax of that procedure in procedure in function thing. I never knew that was possible !
Ideally Lazarus should now start to use the FPC library funcs, but they are done in a very different way.
I looked at :
function Utf8CodePointLen(P: PAnsiChar; MaxLookAhead: SizeInt; IncludeCombiningDiacriticalMarks: Boolean): SizeInt;
It is similar with function UTF8CodepointSize in unit LazUTF8 in package LazUtils. However it has a parameter MaxLookAhead which is used only for checking validity. Why a user should provide such a value? At least it should have a default value.
The parameter IncludeCombiningDiacriticalMarks is wrong in a function called Utf8CodePointLen. A combining diacritical mark is another CodePoint, yet the function name suggests that the length of only one codepoint is returned.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
The LazUtils function UTF8CodepointSize has one parameter and is well optimized. I would not switch it to the FPC's function now.
BTW, the LazUtils package can be used also in console programs. Recommended.
I tend to not use LCL units as I get a lot of warnings/hints from using them (which I always have set to be errors in my compile flags) so to not have to fiddle with using LCL units in a "special" manner I just avoid them and stick with FCL (as I don't get those kinds of warnings/hints from them) and my own units.LazUtils does not depend on LCL.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
Alright, I tried UTF8CodepointSize from lazutf8,It returns the number of bytes in one codepoint as the function name suggests. The following diacritical markers are also codepoints. IMO a function counting them all should be named differently.
...
It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.
If there are none, I'll need to continue using Utf8CodePointLen like I'm already doing, even if the functions in lazutf8 and lazutils are the recommended ones.Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.
Indeed LazUtf8 unit does not have such function now. Using Utf8CodePointLen is a good idea.
I recommended LazUtils package in general. It has many things, not only Unicode stuff.
After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.
Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).The term "character" seems ambiguous to me. In Unicode documentation, character is equivalent to code point.
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.
After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.
But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...Regarding ligatures, there are two different things to consider.
I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)
In what situation would you want only part of the diacritical markers? The only hypothetical case is invalid UTF-8 string with garbage that look like diacritical markers. Now a user must guess how many legal markers a character might have and provide that number as MaxLookAhead. Nonsense, the code should handle such validity checks by itself. Just poor design IMO, or then there is a use case I don't understand.If you store in memory strings without null char between them, you could want that. In the MaxLookAhead you would supply the actual remains length of the string. The next string could start with a mark yet you would not want it to be included.After looking at the code it appears that MaxLookAhead is not named incorrectly. It really is the max look ahead, but only when valid diacritical marker code point byte counts are being consumed.I am not sure what you mean. MaxLookAhead is probably intended to by the remaining length in bytes of the string.
EDIT: never mind I thought there was a bug but there isn't
Now that I found I can wrap their use in my own projects so I don't have to back off on my compiler flags for my own units, I plan to start using them more.Why do you need a wrapper? The package .lpk file is understood only by Lazarus IDE but you should be able to use the units directly.
Oh I forgot to mention, I just discovered that the emojis can also combine even though they are not marks. An interesting read here: https://stackoverflow.com/questions/66062139/combining-some-unicode-nonspacing-marks-with-associated-letters-for-uniform-pr
Unicode documentation on Emoji:
https://www.unicode.org/reports/tr51/
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.
Oh yeah! Good point, I'd not thought of that. There is no reason for it to the set to anything longer anyways.
EDIT: If it just went to the end of the string, then my utf8 string displayable character "splitter" function would not work, as I'm setting the max look ahead to maxint.It doesn't go to the end of the string if there are no more marks. Though I would suggest to set MaxLookAhead to the actual remaining bytes of the string. Otherwise it could go beyond.