taazz, you should really learn even the basics of Unicode before writing instructions.
I thought I did. No? Well lets learn something then.
1) encoding is the way the characters are represented in a string.
No, only codepoints are encoded. A "character" can mean many things. Details below.
I disagree a character is a very specific thing, it can't mean a lot of things, the letter A is a character a Chinese ideogram is a character, the tab key (ascii 09) is not a character it a control "character" and character only by association not functionality. aka it happens to be part of the character set so we call it a character for simplicity it never was.
2) code point is a type/variable etc with the smallest possible length a variable length encoding can have so a code point in utf8 is 1 byte long in utf16 is 2 bytes long. So a code point on utf8 is a byte and in utf16 is a word.
No! You just explained a codeunit. It is the smallest "atom" in Unicode.
No a code unit is word or byte or dword. a code point has the size and type of the code unit but the value of character on the table, but hey lets go with your definition (your as in all of you not you specifically).
3) character it has a minimum size of 1 code point and a maximum based on the encoding. In the case of a utf8 character it can have a size of 1 up to 6 code points (if memory serves me right), in the case of a utf16 character it has a size of 1 up to 2 code points.
Now you explained a codepoint. In a variable length encoding a codepoint consists of one or more codeunits. A "character" is a fuzzy term and can mean many things.
No sorry I define the size of a character a code point does not have variable length.
4) length of a unicode string I will only use length to refere to the size of a string in code points so a length of 10 for a utf16 string can have from 5 to 10 characters with a memory size of 20 bytes, on a utf8 string it can have from 2 to 10 characters with a memory size of 10 bytes.
No.
Since a code point in utf16 is 2 bytes long this is to be expected, any random access to a string, accesses code points and not characters.
No, it accesses codeunits.
if you say so.
The only reliable way to access characters in any variable length encoding is to use a sequential access that would make most processing a bit slow though. I think that converting a random code point to a character is way easier in utf16 than it is in utf8.
Also codepoints require sequential access. It is not any easier in UTF-16 than it is in UTF-8 because they are both variable width encodings. For UCS-2 it would be easier but UCS-2 is obsolete now. More than half of codepoints are already outside BMP and the number grows as Unicode is extended. Even MS Windows has supported full Unicode for almost 18 years now.
I disagree it far easier to determine if a code point is the start or the end or a none of the above in utf16 it is far more convoluted in utf8 although from a logic point of view utf8 only repeats a couple of steps a few more times
I have no idea, lcl 1.6.0 made some pretty aggressive changes on the unicodestring data type that started a conversion of the affected application to C# at work so I never had the chance to look closely at the underline code. You might not converting at all. In any case I'll take a closer look on your test case, probably at the week end.
Again totally false information. How is this possible?
LazUtils in Lazarus 1.6.0 made aggressive changes on AnsiString. UnicodeString is not affected.
I have no idea on how or why as already mentioned I never looked at the problem close enough and I'm not inclined to look now either. I have enough problems finding time to work on pascal as it is I'd rather spend it on creating instead of correcting.
I have improved the wiki page that explains it. Please take a look:
http://wiki.freepascal.org/Unicode_Support_in_Lazarus
The solution turned out to be amazingly compatible with Delphi at source level when few simple rules are followed.
LazUtils package also has unit LazUnicode which allows writing encoding agnostic code. Such code works 100% in Delphi and in Lazarus, using both UTF-16 and UTF-8 encodings. Please take a look.
Thanks. I'll take a close look when I'll write the sql editor for turbobird atsynedit sounds like a proper fit and your unit will be a God send to extend the support if the need arises.
---
This is copied from my post in Lazarus mailing list.
The word "character" can mean the following things when people communicate about encodings and Unicode:
1. CodeUnit — Represented by Pascal type "Char".
2. CodePoint — all the arguments about one encoding's supremacy over
another deal with CodePoints. Yes, UTF-8, UTF-16, UTF-32 etc. all only
encode CodePoints.
Sorry I see no real difference between code point and code unit. For me they are equivalent.
3. Abstract Unicode character — like 'WINE GLASS'.
(There should have been the actual wineglass symbol but this forum SW does not support Unicode and I had to remove it.)
That is a character I agree.
4. Coded Unicode character — "U" + a unique number, like U+1F377. This
is what "character" means in Unicode Standard.
This should not be in this list at all it is only an input/definition method of a character and it is only relevant for parsers, the same way the html encodes %charcode% or the same way the two characters 1 and 5 represent the number fifteen in code.
5. User-perceived character — Whatever the end user thinks of as a character.
This is language dependent. For instance, ‘ch’ is two letters in
English but one letter in Czech and Slovak.
Many more complexities are involved here, including decomposed codepoints.
those are two characters which are read as a single letter in Czech and Slovak, are those characters used alone also? Do they occupy the same space as a single character or as two(visually, I'm mostly curious, does not make any real difference)?
6. Grapheme cluster
Ok this is unknown to me are you talking about the same thing that engkin posted a couple of posts back about compound letters?
7. Glyph — related to fonts.
erm are you talking about the visual representation of the character here? eg gothic letters or times, roman etc? if yes those are not part of the encoding lets not make thing more complicated for now at least.
So, number 4. is the official Unicode "character".
Otherwise the most useful meanings are 1. "CodeUnit" for programmers
and 5. "User-perceived character" for everybody else.
yes and no. The character is definetely the number 5, this is the target. The goal of the encoding is to define the ID of its visual character and what is expected each font to show for that ID with some leeway ee a capital U should be recognizable as the letter capital U a wineglass should be recognizable as wine glass you can use any wine glass you can think of in your font but you should not use a bear mag.
Everything else is the encoding of that information.
There seems to be a bit of confusion of what is a character and what is letter which is understandable after all characters started their lives as representation for the letters.
At this point I would really like to ask to keep the number of definitions as low as possible but I have a filling that I'm alone in this, so I'll stick to my guns for now and I really hope you'll manage to change my mind (it means I learned something new, that is always fun).