Pascal's own UTF8String is MUCH faster. This code would only help for C style strings that depend on the stupid strlen function (yes, even if Pascal's length() - that calls out to strlen for c strings - is used it will run in O(n), not O(1) which is the case for UTF8String that stores its length at a negative offset. It is silly code made up for C programmers. Also the tests should test border cases, not normal cases. That is a false sense of proof.
Thaddy, you have serious gaps in your knowledge about Unicode. Please learn its concepts first.
The proposed function, UTF8Len, counts codepoints just like UTF8Length but faster. UTF8String does not store the count of codepoints at a negative offset or anywhere else. Counting them requires iterating over the string. Please look at how UTF8Length does it.
Another issue is that counting codepoints of a long string is not very useful by itself. The author of the "
Even faster UTF-8 character counting" has a valid comment in the end:
"
Well, the first rule of optimization is to start by finding a good algorithm -- and any algorithm in which the critical path involves counting UTF-8 characters in a 32 megabyte NUL-terminated string is doing something wrong. This is very much a toy problem; but the lesson it teaches is worth remembering: Vectorization is good!"
He uses a term "UTF-8 characters" while he should use "UTF-8 codepoints" but that is a small issue.
He also presents a further idea of using SSE2 in modern Intel CPUs which could allow a super-super fast implementation.
For the reasons mentioned, optimizing UTF8CharacterLength is more useful than optimizing UTF8Length. Iterating codepoints is needed in many algorithm, for example for searching text in word boundaries and returning their codepoint positions.
I plan to apply the version with "case" presented elsewhere.
UTF8Length is also needed sometimes and a faster version can be applied. The optimized versions don't check the validity of UTF-8 which is OK in most use-cases, but it means the original functions must stay as they are.