First, I admit it was stupid to put up rules for not discussing Unicode. I am apparently allergic for the topic after reading FPC lists for 5 years.
Yes, I still recommend for anybody to study the topic because it is unbelievably complex. However talking about Lazarus and the new FPC brings some obvious new questions. Unicode is the main feature of FPC 3.0 after all.
All, that I want from subject - that standard functions works as it described in help and manuals. For example:
Copy() - Copy returns a string which is a copy if the Count characters in S, starting at position Index.
s:=Copy('южный', 2, 2);
it obviously, that it take two characters, starting from second character. And if it return '躰', then that considered as a bug.
There will be such function for sure. Let's see if the name is Copy() also for UTF8.
However working with variable width Unicode strings is always more complex than fixed width AnsiStrings.
When you have to iterate characters, you must use a string to keep a single character.
UnicodeString in Delphi and FPC has the same problem. UTF16 is NOT fixed width although this impression is easy to get when looking at code samples. The old code from AnsiString Delphis will work most of the time but not always. There are almost 100k Unicode characters but 16 bits can address only 64k of them. It means ~30000 characters require 2 words (of type UnicodeChar). Those are maybe rare characters but will cause a bug eventually in code that does not take them into account.
Which encoding should Lazarus use? There seemed to be 2 alternatives, the Delphi compatible UnicodeString and AnsiString + the UTF8 specific functions in LCL.
Then it turned out that FPC + its libs can use UTF8 by simply setting some variables. FPC is well designed, 5 years of arguing in mailing list was not wasted after all.
Lazarus + LCL is already designed for UTF8 and this allows a conversion with least changes. (This assuming nothing unexpected comes up in tests).
It will still be possible to create a version of Lazarus + LCL with UTF16 UnicodeString if somebody wants to implement it.
UTF8 is a very clever encoding. It is backwards compatible with ascii, it produces compact data for western languages (ok, I don't know what Chinese people think of it), and its integrity can be analyzed from the data itself.
The benefit of UTF16 originally was its fixed width characters, but it is not true any more. So its main benefit went away.
Positive news is that user code only seldom needs to iterate single variable width Unicode characters because such things are encapsulated in libraries, and because often the characters of interest are in ascii area. For example many current parsers work with UTF8 data well because all <tag> chars are ascii. Data between tags can be UTF8 but typically the parser just copies it without analysis.
Now I discussed about differences of UTF8 and UTF16, here we go...
We can write about congrete Lazarus implementation details in mailing list when somebody has done tests.