taazz, you should really learn even the basics of Unicode before writing instructions.
1) encoding is the way the characters are represented in a string.
No, only codepoints are encoded. A "character" can mean many things. Details below.
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
No! You just explained a codeunit. It is the smallest "atom" in Unicode.
3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
Now you explained a codepoint. In a variable length encoding a codepoint consists of one or more codeunits. A "character" is a fuzzy term and can mean many things.
4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
No.
Since a code point in utf16 is 2 bytes long this is to be expected, any random access to a string, accesses code points and not characters.
No, it accesses codeunits.
The only reliable way to access characters in any variable length encoding is to use a sequential access that would make most processing a bit slow though. I think that converting a random code point to a character is way easier in utf16 than it is in utf8.
Also codepoints require sequential access. It is not any easier in UTF-16 than it is in UTF-8 because they are both variable width encodings. For UCS-2 it would be easier but UCS-2 is obsolete now. More than half of codepoints are already outside BMP and the number grows as Unicode is extended. Even MS Windows has supported full Unicode for almost 18 years now.
I have no idea, lcl 1.6.0 made some pretty aggressive changes on the unicodestring data type that started a conversion of the affected application to C# at work so I never had the chance to look closely at the underline code. You might not converting at all. In any case I'll take a closer look on your test case, probably at the week end.
Again totally false information. How is this possible?
LazUtils in Lazarus 1.6.0 made aggressive changes on AnsiString. UnicodeString is not affected.
I have improved the wiki page that explains it. Please take a look:
http://wiki.freepascal.org/Unicode_Support_in_LazarusThe solution turned out to be amazingly compatible with Delphi at source level when few simple rules are followed.
LazUtils package also has unit LazUnicode which allows writing encoding agnostic code. Such code works 100% in Delphi and in Lazarus, using both UTF-16 and UTF-8 encodings. Please take a look.
---
This is copied from my post in Lazarus mailing list.
The word "character" can mean the following things when people communicate about encodings and Unicode:
1. CodeUnit — Represented by Pascal type "Char".
2. CodePoint — all the arguments about one encoding's supremacy over
another deal with CodePoints. Yes, UTF-8, UTF-16, UTF-32 etc. all only
encode CodePoints.
3. Abstract Unicode character — like 'WINE GLASS'.
(There should have been the actual wineglass symbol but this forum SW does not support Unicode and I had to remove it.)
4. Coded Unicode character — "U" + a unique number, like U+1F377. This
is what "character" means in Unicode Standard.
5. User-perceived character — Whatever the end user thinks of as a character.
This is language dependent. For instance, ‘ch’ is two letters in
English but one letter in Czech and Slovak.
Many more complexities are involved here, including decomposed codepoints.
6. Grapheme cluster
7. Glyph — related to fonts.
So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.