Utf8CodePointLen and it's max look ahead are very poorly named.
The max look ahead is not a max... It is more of a max allowed for diacritical markers that need be combined with the base code point.
And it is not a single code point size...
It is the size of the base code point in bytes plus all the sizes of the diacritical marker points to be combined into the final displayed character.
To be honest, I stumbled across OP lazer's solution by accident, that was not the problem I was trying to solve originally for myself. I was just wanting to use a function that was not part of Lazarus, saw that the
IncludeCombiningDiacriticalMarks parameter, looked those up as I'd never heard of them (I mean in the unicode sense, of course I'd seen them on character glyphs), and fiddled with it until I got them working by setting
MaxLookAhead high enough....
I was just looking for a function to return the number of bytes in the code point, and nothing more...
There should be another function called Utf8CharacterLen or similar which returns also combining diacritical marks. Yes, the term "character" can have many meanings in Unicode but this may be the best meaning (a codepoint + its combining diacritical marks).
Agreed....
But in some of the things that @circular points out, it gets even more complicated in regard to ligatures...
I'm not sure what rules pertain to them. From the ones I use if 2 characters are combined to make ligature, the result is a double wide character, 3, triple wide, and so forth...
(But those are just the ones I use...)
My programs are not directly GUI related (currently) so I'm relying on text editor (cuda text), terminal (WezTerm), and various web browsers to render the displayed unicode characters correctly...
Web browsers seem to get things right more often than my text editor and terminal...