so where is the misinformation in our comments about the code points?
Well, Fiji and yourself mixed the terms few times :
"... you encounter the multi code
point situation a lot sooner in utf8."
while clearly you meant "multi code
unit situation". I am wrapping my mind around these terms constantly, too. "Multi code point" means accented decomposed Unicode characters. Mixing the terms can make the discussion very confusing.
Now the discussion goes into useless pro/contra encoding bashing. However both encodings are here to stay. At least I am committed to work for the UTF-16 version of LCL in future, once FPC libs are ready.
Yes! Now I have learned details and it feels very realistic. For example "string" type is already needed for individual UTF-8 characters when iterating them. The same concept works perfectly well for UTF-16 and, as an extra bonus, produces more robust code than average UTF-16 code out there. With proper wrapper functions the exact same code can support both encodings! Besides, LCL itself does not need to iterate individual characters often, it is encapsulated if few functions.
No worries, be happy ...
Now the difference compared to the prolonged Unicode discussion in FPC lists is that the decisions are already made. Nobody needs to be converted to add support for a certain encoding.
There is an improved UTF-8 solution already in LCL and UTF-16 is being worked on. Like a miracle it seems possible to support both.
Delphi compatibility is important and thus UTF-16 must be supported, no doubt. Delphi is again gaining popularity and every serious Delphi developer must care about Unicode. Patito wrote nonsense about this issue.
Anyway let's keep the technical facts and terms straight, "code unit" and "code point" and all.
One misconception must be corrected because it keeps popping up : codepoints in UTF-16 are not fixed width and they must not be treated as such. Yes, typical Delphi code does so and thus it is broken. It ignores >35000 codepoints which is a bug.
It also means UTF-16 has no speed advantage here. A proper code must check for surrogate pairs which makes it more complex and slower also when it does not find any.
Looking at my own code I think it is faster with UTF-8 but it is only a "gut feeling", I did not make exact measurements.
In general we can say that both encodings are good enough. In technical perspective this encoding war is quite useless.
The API compatibility issue has been exaggerated, too. Conversion between encodings is quite fast and plays only marginal role with API calls (says my gut feeling). This applies in both directions, for both Windows and Unix APIs.
What more, most parser code continues to work with old ASCII concept regardless of encoding. HTML, XML, BB (bulletin board), SQL etc. use tags and keywords in ASCII-area. A parser typically does not process the data between tags.
Even code that deals with human languages may not need to iterate characters very often. The Unicode specific stuff is often encapsulated in functions.
The problems have been exaggerated.