Alright, I tried
UTF8CodepointSize from
lazutf8,
(had to create a wrapper unit, which I don't need to do with any unit I use from the FPC install, but that is a another issue..., maybe I'm doing something wrong in how I'm specifying where to get the Lazarus units from, which are in sub-directories of the directory that Lazarus is installed in)
// lazutf8.pp
{$push}
{$warnings off}
{$hints off}
{$notes off}
{$include lazutf8.pas}
{$pop}
It does not return the correct number of bytes when an ASCII character is followed by diacritical markers. It returns 1.
Utf8CodePointLen which I'm using does return the correct number of bytes. (The ASCII character, 1 byte, plus the bytes for each diacritical marker).
UTF8CodepointSize does return the correct number of bytes for multi-byte unicode endpoints, but since it does not check for diacritical markers it won't tell you how many bytes for what ends up being a single display character (which can be more than 4 bytes).
I don't see anything in
lazutf8 that works.
I tried all of these:
UTF8CodepointSize(pChar(str)), // had to disable note for declared inline but not inlined
UTF8CharacterLength(pChar(str)), // deprecated, but I tried it anyways
UTF8CodepointStrictSize(pChar(str)),
UTF8CharacterStrictLength(pChar(str)), // deprecated, but I tried it anyways
UTF8Length(pChar(str)),
UTF8LengthFast(pChar(str)),
They all come up short as far as number of bytes in the displayed character when trailing diacritical markers are present.
So what from lazutils (or lazutf8) can be used to get the number of bytes in the string that are for the displayed character?
If there are none, I'll need to continue using
Utf8CodePointLen like I'm already doing, even if the functions in
lazutf8 and
lazutils are the recommended ones.