Recent

Author Topic: Need help understanding the effects of Unicode  (Read 16220 times)

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Need help understanding the effects of Unicode
« Reply #15 on: June 08, 2015, 11:55:09 am »
the difference is that in utf16 the code points and characters are the same thing in 99% of languages
Quote
other than the fact that you encounter the multi code point situation a lot sooner in utf8.
Minor correction, afaik that is true equally for utf8 and 16. They have the same codepoints. The difference is that in utf8 a codepoint has a variable amount of bytes.
So the likelihood of multi-codepoint is the same for both.
just to make sure we are talking about the same thing here when I say multipoint I mean multiple bytes in utf8 (since a code point is 1 byte long) and multiple words for utf16.

Also the existence of pre-composed chars (e.g. for accented) does not guarantee the use of those. Unless you know the source of your text, you may well encounter decomposed chars (2 codepoints) in many European languages (incl French, German, and others)


I have no personal experience with de/pre-composed chars I'll simple have to rely on developers that face the problem to inform me for the range those chars are in.
In my current understanding though the only thing that changes is the translation ee you see them one next to the other and you translate it as a char that is represented by the pre-composed one AKA a locale specific interpretation-problem. In some other local might be seen as some other char and if there is no way to be seen as seperate characters then the standard failed to the simplest of things, keep it self inside its boundaries.


Quote
I don't see anything that should allow us to choose utf8 or utf16 in your examples
I didnt make an argument for utf8. I simply corrected a point against it. The speed argument is in many cases over-exaggerated. That said, there may be cases where it can apply. There are also cases where utf8 is faster (because pure English text needs less memory in utf8, and may reduce cache misses).

Quote
you are api native as a plus
Depends on your OS. Afaik win API is utf16. Linux is utf8. But I would have to double check that.
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.
I do try to see what UTF8 has to offer that utf16 does not the other way around is a bit more obvious and I can't. Even ASCII compatibility is not a requirement for me.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #16 on: June 08, 2015, 11:59:35 am »
About the wine glass. This is a surrogate pair. So technically it is 2 codepoints (even so afaik neither of them can stand alone in vaild utf16)

Martin, you have more knowledge about Unicode than I have in general but now you are wrong. A surrogate pair is only one codepoint. It is a concept used with only UTF-16 but is equivalent with multi-byte codepoints in UTF-8.
As you remember I was completely confused with this term and used it in a wrong way. After that I had to study what it means.

Quote
Actually in UTF8 codepoints do have a fixed wide (1 word = 2 bytes). But characters still are of variable length. And that can (as in optionally) apply to much more common examples such as accented chars, umlauts, and others.

I think I understand what you mean but this is very confusing for everybody, especially because people don't understand even the encodings.
So please let's not bring the decomposed accented characters into the discussion now. This thread is already going into a bad direction.
« Last Edit: June 08, 2015, 12:11:48 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #17 on: June 08, 2015, 12:05:46 pm »
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.

All Unix related systems including Linux, OSX and Android use UTF-8 natively. That is the vast majority of systems in use already today and the difference is growing.
QT made an unfortunate choice back then.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12641
  • FPC developer.
Re: Need help understanding the effects of Unicode
« Reply #18 on: June 08, 2015, 12:07:04 pm »
Depends on your OS. Afaik win API is utf16. Linux is utf8. But I would have to double check that.

Linux kernel is 1-byte and agnostic wrt contents.

POSIX Userland is a mix of utf-8 and one other type  (wchar_t) which can be utf-16 and utf-32. Since C is all manual string type anyway that matters less for them.

QT is utf16.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12641
  • FPC developer.
Re: Need help understanding the effects of Unicode
« Reply #19 on: June 08, 2015, 12:08:19 pm »
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.

All Unix related systems including Linux, OSX and Android use UTF-8 natively.

Cocoa is also 16-bit. The majority of GUI systems seems to be 16-bit.

I'm not sure, but iirc gettext and freetype use a lot of wchar_t internally too, iow they only present an utf-8 interface, but are wchar_t internally

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Need help understanding the effects of Unicode
« Reply #20 on: June 08, 2015, 12:19:37 pm »
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.

All Unix related systems including Linux, OSX and Android use UTF-8 natively. That is the vast majority of systems in use already today and the difference is growing.
QT made an unfortunate choice back then.
OR unix made the unfortunate choice. The more I read the more I see that the case for utf8 is an emotional one not a logical one. In any case I'm out of this thread I'll start a new if I have something new to say.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #21 on: June 08, 2015, 12:46:43 pm »
Cocoa is also 16-bit. The majority of GUI systems seems to be 16-bit.
I'm not sure, but iirc gettext and freetype use a lot of wchar_t internally too, iow they only present an utf-8 interface, but are wchar_t internally

Marco, you know the details better here.
Now I use this opportunity to mention the future encoding of FPC + LCL.
We already have the "better" UTF-8 support using "SetMultiByteConversionCodePage(CP_UTF8)" in new FPC etc.
The wiki page also kind of promises a UTF-16 version of LCL, once RTL and other FPC libs support it.

I know there is a confrontation between those 2 encondings but I have realized it is quite useless confrontation. For example the remaining issues about the LCL UTF-8 support deal with old WinAPI calls in some FPC libs. But hey, fixing them will equally help the coming UTF-16 support which also requires the new "W" WinAPI calls! There is no conflict of interest here.

I have also learned that it is quite easy to make code that works with both encodings. We need some more wrapper functions to make it easier, as mentioned in the wiki page:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus#Helper_functions_for_CodePoints
Question: How to implement CodePointCopy, CodePointLength and CodePointPos for Delphi + UTF-16?
This is an innocent question because I don't know how to do it. Nothing to do with the pro/contra encoding thing.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12641
  • FPC developer.
Re: Need help understanding the effects of Unicode
« Reply #22 on: June 08, 2015, 01:01:54 pm »
Cocoa is also 16-bit. The majority of GUI systems seems to be 16-bit.
I'm not sure, but iirc gettext and freetype use a lot of wchar_t internally too, iow they only present an utf-8 interface, but are wchar_t internally

Marco, you know the details better here.
Now I use this opportunity to mention the future encoding of FPC + LCL.

Sure. As long as it is Delphi compatible and UTF-16 on Windows , I'm fine with everything.
 

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12095
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #23 on: June 08, 2015, 01:18:15 pm »
just to make sure we are talking about the same thing here when I say multipoint I mean multiple bytes in utf8 (since a code point is 1 byte long) and multiple words for utf16.
Ok, you mean code-unit (see table at http://en.wikipedia.org/wiki/Code_unit#Code_unit )

code unit: utf8 = 8 bit / utf16 = 16 bit
code point: utf8 = 1..4 code units / utf16 = 1 code unit  [[[EDIT: As corrected later utf16 1 or 2 code units ]]]
char: utf8 and utf16: 1 or more codepoints (single, combining or surrogate)
  No idea if a combining mark can be added to a surrogate pair (I see no reason why not).
  You can have many combining added to one codepoint (and sometimes the order matters, sometimes not)
glyph: can be 1 or more chars (and maybe one char can consist of multiply glyphs? not sure)


Quote
I have no personal experience with de/pre-composed chars I'll simple have to rely on developers that face the problem to inform me for the range those chars are in.
In my current understanding though the only thing that changes is the translation ee you see them one next to the other and you translate it as a char that is represented by the pre-composed one AKA a locale specific interpretation-problem. In some other local might be seen as some other char and if there is no way to be seen as seperate characters then the standard failed to the simplest of things, keep it self inside its boundaries.
de-composed and pre-composed are identical chars, they only have a different representation.

As to the need of some code to recognize them, that depends on what the code does.
If you have the 2 codepoints "e" and <accent grave>, then:
- code counting chars should count the 2 codepoints as one char
- code inserting newlines (hard wrap every 80 chars) must not but  a newline between the 2 codepoints
- code searching a sub-string should normalize both strings first

- code performing a binary match doesn't need to care (but then this can use bytes anyway)

Quote
Yes I was referring to the windows API it is after all the most used api in the world (for how long I don't know) but there is the case of the underline widget set as well, eg QT is utf16 even on linux.
I do try to see what UTF8 has to offer that utf16 does not the other way around is a bit more obvious and I can't. Even ASCII compatibility is not a requirement for me.

Well common api calls, I can think of are:
- painting. The actual painting probably takes way longer than the conversion. But thats just my guess....
- file system: conversion vs processing time depends on the media used?

Anyway I didnt say utf8 was better. I only tried to point out that the speed difference is (by far) not as much a matter as this seems to be implied sometimes. (And see my earlier post, utf8 can be faster in special cases)

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Processing_issues
Quote
but even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character
For what's better: Search the internet. There are thousands of articles.

My conclusion: Neither is better in itself. It depends on what you want/need to do.

As for utf8 2 special cases that come to mind:
1) *English* text in Utf8 saved to file, can be opened by none utf editors. Of course that is English only.
2) Lazarus as pascal IDE. Since pascal source code (unless lots of comments, or inlined strings in other languages) uses mainly latin chars, utf8 saves memory used by the IDE.



« Last Edit: June 13, 2015, 05:16:00 pm by Martin_fr »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #24 on: June 08, 2015, 01:19:16 pm »
OR unix made the unfortunate choice. The more I read the more I see that the case for utf8 is an emotional one not a logical one. In any case I'm out of this thread I'll start a new if I have something new to say.

Ok, I should not write this "unfortunate choice" thing. It was not based on logic indeed. Sorry about that.
This is a sensitive topic. Unicode discussions continued in FPC mailing lists for many many years. Most of it was useless repetition of same things again and again ...

Anyway some corrections about UTF-8, based on facts:

Quote
UTF-8 was made with storing in mind, not processing. UTF-8 processing requires complex state machines and is O(N^2)
...
 Sure if you process only English text then I guess yes.

Fiji, you don't get O(N^2) even if you iterate individual UTF-8 characters. You are doing something wrong. Please send your code snippet and we can have a look.
Also as mentioned in the wiki page :
A byte at a certain position in a multi-byte sequence can never be confused with the other bytes. This allows using the old fast string functions like Pos() and Copy() in many situations where UTF-16 would need more complex and slower code.

What more, most parsers are interested in tag-characters which are in ASCII-area. They continue to work without UTF-8 specific functions.
It is amazing how much code can be made without UTF-8 specific functions, still fully supporting UTF-8.

Quote
Other than that symbol and a couple super rare ones it is fixed width. Compared to UTF8 when it is not as soon as you add a bit of Cyrilic or Chinese..

Unicode has some 100000+ characters defined. 16 bits can encode ~65000 of them. It means > 35000 charactes don't fit in one word. Yes, they may be seldomly used but a proper code must still handle them. It is not acceptable to have known bugs in the code even if they happen only "sometimes".
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 12095
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #25 on: June 08, 2015, 01:24:07 pm »
About the wine glass. This is a surrogate pair. So technically it is 2 codepoints (even so afaik neither of them can stand alone in vaild utf16)

Martin, you have more knowledge about Unicode than I have in general but now you are wrong. A surrogate pair is only one codepoint. It is a concept used with only UTF-16 but is equivalent with multi-byte codepoints in UTF-8.
As you remember I was completely confused with this term and used it in a wrong way. After that I had to study what it means.

Indeed, I was mislead by wikipedia (i.e. I misread them).

http://en.wikipedia.org/wiki/UTF-16#U.2BD800_to_U.2BDFFF
Quote
U+D800 to U+DFFF

The Unicode standard permanently reserves these code point values for UTF-16 encoding of the high and low surrogates,

So D8XX is a codepoint. But it is not defined. And in utf16 a sequence of 2 code-units using those values describe ONE other codepoint.
« Last Edit: June 08, 2015, 01:31:01 pm by Martin_fr »

sfeinst

  • Sr. Member
  • ****
  • Posts: 255
Re: Need help understanding the effects of Unicode
« Reply #26 on: June 08, 2015, 04:17:17 pm »
For those wondering.  I converted all my string function calls to their UTF8 equivalents and also replaced a call similar to:
MyStr[Index]
to
UTF8Copy(MyStr, Index, 1)

and all my code looks like it is working now (shift text in, shift text out, auto indent, search and replace).

I need to do more testing, but looking good.  Thanks for all the help. everyone.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4673
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #27 on: June 08, 2015, 05:35:34 pm »
For those wondering.  I converted all my string function calls to their UTF8 equivalents and also replaced a call similar to:
MyStr[Index]
to
UTF8Copy(MyStr, Index, 1)

and all my code looks like it is working now (shift text in, shift text out, auto indent, search and replace).

Good ... except that you should not blindly replace all string functions with their UTF8 equivalents. You can end up with O(N^2) performance as Fiji warned.
Your original problem indeed requires UTF8...() functions because TMemo uses UTF-8 char positions.
Yet in many cases you can use byte positions with UTF-8 data and it works correctly. You must look at it case by case. See examples in the wiki page.

As I have revealed I like UTF-8. I started to like it after learning the details and experimenting with code. IMO it is a brilliant encoding and shines also with the ways it can be processed by code. There is so much wrong information about the strengths and weaknesses of different encodings.

Yet supporting UTF-16 for Delphi compatibility and for other reasons is a valid goal. I believe we can have a future LCL which is source compatible with both encodings. It means binary version for both encodings can be compiled from the same sources. It will take many years to happen though.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

sfeinst

  • Sr. Member
  • ****
  • Posts: 255
Re: Need help understanding the effects of Unicode
« Reply #28 on: June 08, 2015, 05:55:35 pm »
Good ... except that you should not blindly replace all string functions with their UTF8 equivalents. You can end up with O(N^2) performance as Fiji warned.
Your original problem indeed requires UTF8...() functions because TMemo uses UTF-8 char positions.
Yet in many cases you can use byte positions with UTF-8 data and it works correctly. You must look at it case by case. See examples in the wiki page.
Absolutely.  My replacements were only for when dealing with the TMemo field and especially when selections were involved (which is why search/replace got involved).


Patito

  • New member
  • *
  • Posts: 7
Re: Need help understanding the effects of Unicode
« Reply #29 on: June 09, 2015, 09:55:04 am »
This discussion contains quite a lot of misinformation... It hurts enough for me to comment...

1) Basics
code unit: utf8 = 8 bit / utf16 = 16 bit   (usually char = code unit)
code point: utf8 = 1..4 code units / utf16 = 1..2 code units
glyph: can be 1 or more code points (not so easy in any encoding)

2) Speed of Algorithms
That UTF-8 is inherently slower is a myth. People arguing for that are usually confusing UTF16 with UCS-2.
UTF16 is in most usecases about Faktor 2 slower. (Latin Alphabet, Numbers and Whitespace need twice as much memory, so e.g. copying is x2 slower).
Also Dictionaries in UTF16 eigher use x 256 more Memory or are x2 slower for most Usecases.
(Take a look at a Chinese Dictionary, and you'll notice that having more letters in the Alphabet buys you trouble)
It's not 100% correct, but usually it's safe to say that UTF-16 uses more memory and is slower.

In some Edge-Cases (with a lot of Russian Characters) there's a slight advantage for UTF-16, but it's only a few percent.
In General if you want to process Russian Text both UTF-8 and UTF-16 are equally bad. You could use a completely different custom encoding optimized for Russian and gain x2 in speed.

3) Delphi Compatibility
What kind of Delphi do people use? Most Delphi-people I met use D5, D7, .... Those "Upgrading" to something unstable did that for a while and then migrated to something else. If Delphi-Compatibility is an issue anything UTF-16 is nonsense.

Also "the world" (Web, Databases) use UTF-8. For serialization UTF-16 uses x2 bandwith - nobody with a brain uses UTF-16 to serialize data. So using UTF-16 internally adds a lot of forced conversions when talking to the world.

Those few calls to the Win32-API don't really matter. A forced useless conversion when bulk-reading external data is something that does matter...

 

TinyPortal © 2005-2018