When including LAzUtf8 as my first unit (See second program below), StringOfChar does not map to UTF8StringOfChar. Why? In general, what is the list of string functions that won't map versus the list that would? Is it only the list found in LazUnicode? Why only this list.
Which way does it not map? Do you mean LazUnicode has no similar function? I can add StringOfCodePoint() there. This should work, although is not optimized:
I meant that calling StringofChar does not work the way Length or Copy or Pos work. I would like to continue to use StringofChar and not have to swtich to StringofCodePoint as you've done below. The same goes on for Ord. I get that Ord as it is implemented does not work for Unicode. Let me ask my question this way: Why is it that when I add LazUtf8 to my project, it is unable to map Ord and StringofChar and every other string related function to a unicode equivalent? This would be ideal. I'm certain there are very good reasons why this is not done, it's just not obvious.
Iterating through a string using an integer index does not work: This is perhaps the hardest one to deal with: We're so used to write For i := 1 to length(S) and it should be clearly mentioned, unless of course, I've done something wrong in the next program
It does work! You are then iterating codeunits, not codepoints. In many cases the codeunit resolution is usefull also with variable length encodings.
In your case you must iterate codepoints. Using LazUnicode :
for ch in s do
Do_your_thing_with(ch);
Note, it does not work right with decomposed accent marks. For that, you must use TUnicodeCharacterEnumerator.
[/quote]
You missed the point. In Delphi if I remember correctly, I could write For i := 1 to length(s) do S
:= 'Ç'; The compiler inherently understood Unicode and was able to do the right thing. I could also use while and repeat loops as well. Here, I've got to switch to a string based iterator and I don't think there's a transparent way to iterate over while and repeat loops with that iterator. You know all of this and you find it easy because you've been doing it for a while. For most of us this is bewildering
[/quote]
There is at least one big benefit: You must code right because the multi-byte codepoints are so common. Then as an extra bonus it supports all codepoints without exceptions.
Agreed. I do see that now.
Nice educational CompareStr. I have used AnsiCompareStr myself.
Integer iterator works as expected. What is the problem?
But if Unicode space is divided into 17 planes and since a given unicode character can belong to more than one plane where conceivably, its relative ordinal index in that plane is distinct from its absolute unicode ordinal index, how does my CompareStr continue to work? Is it because the RTL has already set the plane to say utf8 behind the scene and does this then changes the ordinal value of the charaters? That's where I'm not clear. In fact, I looked at your implementation of CompareStr: it relies on a CompareStrW which is clearly a call to the underlying Windows Os. I'm not clear why this was so.
Thus far, I find my experience with Unicode (utf-8 / utf-16) to be a bit unsettling. I'm not sure which function I could use in a transparent way via LazUnicode versus which one I must invoke with the utf8 prefix.
We can add more functions to LazUnicode. What is missing?
Well, am I to assume that all the routines in StrUtils work as is or only those where the type is String and not AnsiString? Since Strutils is part of RTL which is Utf-16, is there a requirement to transform from utf8 to utf16 before calling these routines?
Also which of the standard pascal operators work with unicode? =, > , <, <>? Again since these are managed by the compiler, should their argument be utf-16 encoded?
I'm sure to you all of these issues are obvious but for me as I begin to piece this puzzle back together, there's a lot of these questions that come-up. The link you provided below, is extremely useful. I saw it listed in the main wiki page but I thought it dealt with the technical details of utf-8.
Do you mean iterating codeunits versus iterating codepoints? They are both usefull. See the UTF8_strings_and_characters wiki page.
On the flip side, utf-16 seems to be a bit more streamlined ...
You mean UCS-2 is more streamlined? UTF-16 is a variable width encoding. UCS-2 is rather obsolete now. Even Windows has supported full Unicode for almost 18 years now.
To your point, what's tricky here is that almost every character I'm familiar with (Latin standard, French, Arabic, and Syriac) are represented in the UCS-2 encoding and for these it would work out of the box, so there's this tendency to believe it would work consistently. But then if the RTL is utf-16 based can I at least assume that utf-16 works as expected in the RTL?