just to make sure we are talking about the same thing here when I say multipoint I mean multiple bytes in utf8 (since a code point is 1 byte long) and multiple words for utf16.
Ok, you mean code-unit (see table at
http://en.wikipedia.org/wiki/Code_unit#Code_unit )
code unit: utf8 = 8 bit / utf16 = 16 bit
code point: utf8 = 1..4 code units / utf16 = 1 code unit [[[EDIT: As corrected later utf16 1 or 2 code units ]]]
char: utf8 and utf16: 1 or more codepoints (single, combining or surrogate)
No idea if a combining mark can be added to a surrogate pair (I see no reason why not).
You can have many combining added to one codepoint (and sometimes the order matters, sometimes not)
glyph: can be 1 or more chars (and maybe one char can consist of multiply glyphs? not sure)
I have no personal experience with de/pre-composed chars I'll simple have to rely on developers that face the problem to inform me for the range those chars are in.
In my current understanding though the only thing that changes is the translation ee you see them one next to the other and you translate it as a char that is represented by the pre-composed one AKA a locale specific interpretation-problem. In some other local might be seen as some other char and if there is no way to be seen as seperate characters then the standard failed to the simplest of things, keep it self inside its boundaries.
de-composed and pre-composed are identical chars, they only have a different representation.
As to the need of some code to recognize them, that depends on what the code does.
If you have the 2 codepoints "e" and <accent grave>, then:
- code counting chars should count the 2 codepoints as one char
- code inserting newlines (hard wrap every 80 chars) must not but a newline between the 2 codepoints
- code searching a sub-string should normalize both strings first
- code performing a binary match doesn't need to care (but then this can use bytes anyway)
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.
I do try to see what UTF8 has to offer that utf16 does not the other way around is a bit more obvious and I can't. Even ASCII compatibility is not a requirement for me.
Well common api calls, I can think of are:
- painting. The actual painting probably takes way longer than the conversion. But thats just my guess....
- file system: conversion vs processing time depends on the media used?
Anyway I didnt say utf8 was better. I only tried to point out that the speed difference is (by far) not as much a matter as this seems to be implied sometimes. (And see my earlier post, utf8 can be faster in special cases)
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Processing_issues but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character
For what's better: Search the internet. There are thousands of articles.
My conclusion: Neither is better in itself. It depends on what you want/need to do.
As for utf8 2 special cases that come to mind:
1) *English* text in Utf8 saved to file, can be opened by none utf editors. Of course that is English only.
2) Lazarus as pascal IDE. Since pascal source code (unless lots of comments, or inlined strings in other languages) uses mainly latin chars, utf8 saves memory used by the IDE.