Look its slower provide a benchmark that shows its faster. Until then my facts stand.
Benchmarks would be interesting but they should present valid bug-free code, not the sloppy and buggy kind that you are proposing.
Let's repeat the facts. Unicode currently defines over 110,000 codepoints.
According to a BabelStone page
http://babelstone.blogspot.fi/2005/11/how-many-unicode-characters-are-there.htmlthere are 120737 characters but I guess it includes multi-codepoint accented characters. Number of codepoints is little lower. (Where can I find the exact number of codepoints?)
The graphs show (up to 2014) how the amount keeps growing.
One 16-bit word in UTF-16 can directly hold 2^16 = 65536 unique numbers. It means some ~50000 codepoints don't fit in a 16-bit word and must be encoded using surrogate pairs (32 bits).
Code that ignores those codepoints is broken. Sure, those are rare codepoints and the bugs will pop up only sometimes. It is still not acceptable. If you have a mathematical algorithm that gives wrong results "sometimes", it is considered broken and must be fixed. Why should code for character encodings be different?
Fiji, you keep repeating this false information, basically claiming that UTF-16 is fixed width.
http://forum.lazarus.freepascal.org/index.php/topic,28660.msg179684.html#msg179684Unfortunately you are not the only one doing so. Let's see what has caused this misconception ...
Delphi switched to UTF-16 strings at 2009. It was a big change. Obviously customers asked troublesome questions like "How compatible is it?" and "How much conversion work we must do?"
The marketing team, being creative, decided to say "Yeah, yeah, it is compatible. No worries!" instead of explaining technical details about surrogate pairs or multi-codepoint characters.
They were mostly worried about their sales.
Part of the same marketing tactics was to name the new string type as UnicodeString. For the sake of symmetry it should be UTF16String because there is also UTF8String. It is apparently so confusing that "stocki" still at late 2015 believes that Unicode = UTF-16, despite all the information available in the net.
UTF-16 surrogate pairs don't have the same inherent properties as UTF-8 multi-byte codepoints have. If you use the fast Pos() etc. functions, it can go wrong sometimes. With UTF-8 it goes
always right. Thus UTF-8 is faster in real-world applications when used cleverly.
UTF-16 was invented a long time ago. Then there were less than 65536 codepoints and it had a fixed width. This one big benefit was lost when the amount of codepoint definitions grew. UTF-16 has also other issues, like the CPU endiannes dependency.
Yet, I am not against it. A Delphi compatible Unicode system must be implemented. Backwards campatibility must always deal with technical decisions made in the past. No problem.
Unicode is complex. We just have to accept it. Now the discussion is only about codepoints, but even if you can find codepoints 100% accurately, you still don't know if it's part of a multi-codepoint accented character.
No encoding can solve that, not even UTF-32, because they are defined in Unicode character level instead of encoding level.