Recent

Author Topic: Need help understanding the effects of Unicode  (Read 16240 times)

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Need help understanding the effects of Unicode
« Reply #30 on: June 09, 2015, 11:09:18 am »
This discussion contains quite a lot of misinformation... It hurts enough for me to comment...

1) Basics
code unit: utf8 = 8 bit / utf16 = 16 bit   (usually char = code unit)
code point: utf8 = 1..4 code units / utf16 = 1..2 code units

(Older utf-8 standards were specified to contain 6 units, though I doubt much high utf8 codepoints can be found in the wild)

Quote
glyph: can be 1 or more code points (not so easy in any encoding)

The rationale was that this mainly mattered when rendering text, and the codepoint to glyph translation is dependent on the properties of the screen (e.g. change o-umlaut to oe when a terminal doesn't support umlauts, ligature availability in the font etc)

Quote
2) Speed of Algorithms
That UTF-8 is inherently slower is a myth. People arguing for that are usually confusing UTF16 with UCS-2.
UTF16 is in most usecases about Faktor 2 slower. (Latin Alphabet, Numbers and Whitespace need twice as much memory, so e.g. copying is x2 slower).

Wait a minute. What are you benchmarking? Pumping text around or processing it? FPC string types are copy-on-write.

Quote
Also Dictionaries in UTF16 eigher use x 256 more Memory or are x2 slower for most Usecases.

Please explain. How is a codepoint dictionary in utf16 larger than in utf8 for a complete set of codepoints?

Quote
In some Edge-Cases (with a lot of Russian Characters) there's a slight advantage for UTF-16, but it's only a few percent.

Also Middle Eastern languages and in some cases also Latin languages with accents. E.g. if the existence of larger codepoints requires reallocation of memory in utf8, while they are not in utf16.

Quote
3) Delphi Compatibility
What kind of Delphi do people use?

I encounter both. People writing new code are mostly using new versions(typically XE2-3 and up). I also encounter a lot of D7, but they are mostly making fairly simple utilities. (e.g. PLC HMIs).  They are not looking for change.  And with every year this pattern gets more pronounced.

Yes, old Delphis are used, but their usage is fairly inert. There are exceptions (e.g. people with large non visual codebases), but those are not a _Lazarus_ target in the first place.

Most people that maintain complex GUI apps migrated, if only for unicode, newer Office and Vista+ support.

Quote
Most Delphi-people I met use D5, D7, .... Those "Upgrading" to something unstable did that for a while and then migrated to something else. If Delphi-Compatibility is an issue anything UTF-16 is nonsense.

That is simply nonsense. As said the Delphi world migrated, and when component vendors start killing old unicode support they'll probably also kill off whatever Lazarus support they have (many unicode version only components already have). Simply because then suddenly for Lazarus is a major extra effort. Worse even if Lazarus has some own invention.

The same goes for the open source projects, though probably that will take longer.

Quote
Also "the world" (Web, Databases) use UTF-8.

As Document format. We are here talking about application's internal format. And then the world uses Java and C# and they are UTF16 internally.


Quote
For serialization UTF-16 uses x2 bandwith - nobody with a brain uses UTF-16 to serialize data. So using UTF-16 internally adds a lot of forced conversions when talking to the world.

Most textual serialization in Delphi Unicode is utf16. Again you are confusing application string formats and document formats.

Quote
Those few calls to the Win32-API don't really matter. A forced useless conversion when bulk-reading external data is something that does matter...

Killing the component market, and losing connection with the Delphi Open source sector does matter.

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Need help understanding the effects of Unicode
« Reply #31 on: June 09, 2015, 11:19:48 am »
This discussion contains quite a lot of misinformation... It hurts enough for me to comment...

1) Basics
code unit: utf8 = 8 bit / utf16 = 16 bit   (usually char = code unit)
code point: utf8 = 1..4 code units / utf16 = 1..2 code units
glyph: can be 1 or more code points (not so easy in any encoding)
so where is the misinformation in our comments about the code points? or is it here to establish a bases of communication?

2) Speed of Algorithms
That UTF-8 is inherently slower is a myth.
People arguing for that are usually confusing UTF16 with UCS-2.

No its not it hits the multibyte a lot sooner that utf16 and in some cases it might never hit it eg an SQL parser would almost never go beyond the first plane of the utf16.

UTF16 is in most usecases about Faktor 2 slower. (Latin Alphabet, Numbers and Whitespace need twice as much memory, so e.g. copying is x2 slower).
Also Dictionaries in UTF16 eigher use x 256 more Memory or are x2 slower for most Usecases.
256 times more memory than utf8 ? really? or is something else implied with the x256? any proof to back up such a ridiculous claims?
and 2 times slower? an other use case I haven't seen? wow you are full of use cases.

(Take a look at a Chinese Dictionary, and you'll notice that having more letters in the Alphabet buys you trouble)
It's not 100% correct, but usually it's safe to say that UTF-16 uses more memory and is slower.

What are you talking about? what utf16 has in common with the chinese alphabet? Well they are called ideograms if I'm not mistaken and its not an alphabet but yeah lets call it that. In one case you try to create a table of characters for all known languages in the second case is a single language that wend overboard or not designed properly in the first place drawing any kind of relationship between them is laughable at best.

In some Edge-Cases (with a lot of Russian Characters) there's a slight advantage for UTF-16, but it's only a few percent.
In General if you want to process Russian Text both UTF-8 and UTF-16 are equally bad. You could use a completely different custom encoding optimized for Russian and gain x2 in speed.

As far as I can see all Russian characters are on the first plane in utf16 so processing it should not be any different than English in the same encoding, as for utf8 you are already in problems the moment you left the first 255 chars processing wise that is.

3) Delphi Compatibility
What kind of Delphi do people use? Most Delphi-people I met use D5, D7, .... Those "Upgrading" to something unstable did that for a while and then migrated to something else. If Delphi-Compatibility is an issue anything UTF-16 is nonsense.

So you haven't met any one needing it yet? Good for you.

Also "the world" (Web, Databases) use UTF-8. For serialization UTF-16 uses x2 bandwith - nobody with a brain uses UTF-16 to serialize data. So using UTF-16 internally adds a lot of forced conversions when talking to the world.

Web uses it because it might save some disk space databases for the same reason although most of the databases supported utf8 first then wend for utf16 and other encodings and this has played a huge role in the use of utf8 in databases if the first was utf16 then things might have been different.

I see your point about serialization and raise you one, any one with half a brain uses binary data for communication, serialization is such a waste of bandwidth.

My experience with serialization is, serialize, compress, in some cases encrypt and send. I bet that utf16 and utf8 would compress to almost identical sizes never tested it though. Grand it my wan/lan communication experience is from 2 companies only and on the financial sector on top of that, a 3rd party we worked with was java based and used serialization and utf16 (xml files) for everything with no problems (compression was part of their transportation).

Those few calls to the Win32-API don't really matter. A forced useless conversion when bulk-reading external data is something that does matter...

What makes you think that it is faster to convert a few thousands of times instead of 2 or 10 or in case of badly designed software 500 serializations? now that I think about it what makes you think that serialization needs any kind of conversion at all? It should go from utf8 data in database to utf8 serialization data directly even if my application uses utf16.

In sort do share those use cases with us, I'm most interested to run my own tests so far the only one I got out of this conversation is copying. What else you have for me? I'm most interested in this.


PS. I do not consider copying of data as processing it requires no code at all a simple move X bytes from A to B and with the existing mechanism for string length in fpc and delphi there is no real test to call length oh wait thats my next use case length.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

Patito

  • New member
  • *
  • Posts: 7
Re: Need help understanding the effects of Unicode
« Reply #32 on: June 09, 2015, 01:10:06 pm »
Please people, at least you should try to educate yourself a bit about the subject, before trying to discuss the subject in publc.  (Marco: UTF-8 Standard doesn't allow 5 or 6 bytes)

It's useful to study algorithms, benchmark them, and google helps to understand the basic facts.
Do that first. Do half as much as I did, and maybe then we can talk.

I'm not your nanny to answer beginner questions. And I will not argue with confused people about
whether encrypting 2MB is slower than encrypting 1MB, or if time to allocate memory can't be called processing, and therefore doesn't matter...

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Need help understanding the effects of Unicode
« Reply #33 on: June 09, 2015, 01:14:37 pm »
Please people, at least you should try to educate yourself a bit about the subject, before trying to discuss the subject in publc.  (Marco: UTF-8 Standard doesn't allow 5 or 6 bytes)

It's useful to study algorithms, benchmark them, and google helps to understand the basic facts.
Do that first. Do half as much as I did, and maybe then we can talk.

I'm not your nanny to answer beginner questions. And I will not argue with confused people about
whether encrypting 2MB is slower than encrypting 1MB, or if time to allocate memory can't be called processing, and therefore doesn't matter...
and blonked. Emotionally unbalanced people should not post on public forums. What a waste of our time.
Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12645
  • FPC developer.
Re: Need help understanding the effects of Unicode
« Reply #34 on: June 09, 2015, 01:29:31 pm »
Please people, at least you should try to educate yourself a bit about the subject, before trying to discuss the subject in publc.  (Marco: UTF-8 Standard doesn't allow 5 or 6 bytes)

It does not now. It did till 2003, which is what I said:

Quote from: marcov
(Older utf-8 standards were specified to contain 6 units, though I doubt much high utf8 codepoints can be found in the wild)

Quote
It's useful to study algorithms, benchmark them, and google helps to understand the basic facts.
Do that first. Do half as much as I did, and maybe then we can talk.

The problem is that you don't seem to grok the difference between utf8 file format and utf8 in a production application that isn't only parametrizing some template.

Quote
or if time to allocate memory can't be called processing, and therefore doesn't matter...

It does. If you need to preallocate a larger amount of memory for utf8 and then have to shrink it, while with utf16 it is ok to simply preallocate and only reallocate in the way more remote
chance that a surrogate is encountered, then the UTF8 operation has extra memory manager operations compared to the utf16 one.

« Last Edit: June 09, 2015, 02:25:52 pm by marcov »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4676
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #35 on: June 09, 2015, 09:05:00 pm »
so where is the misinformation in our comments about the code points?

Well, Fiji and yourself mixed the terms few times :
  "... you encounter the multi code point situation a lot sooner in utf8."
while clearly you meant "multi code unit situation". I am wrapping my mind around these terms constantly, too. "Multi code point" means accented decomposed Unicode characters. Mixing the terms can make the discussion very confusing.

Now the discussion goes into useless pro/contra encoding bashing. However both encodings are here to stay. At least I am committed to work for the UTF-16 version of LCL in future, once FPC libs are ready.
Yes! Now I have learned details and it feels very realistic. For example "string" type is already needed for individual UTF-8 characters when iterating them. The same concept works perfectly well for UTF-16 and, as an extra bonus, produces more robust code than average UTF-16 code out there. With proper wrapper functions the exact same code can support both encodings! Besides, LCL itself does not need to iterate individual characters often, it is encapsulated if few functions.
No worries, be happy ...

Now the difference compared to the prolonged Unicode discussion in FPC lists is that the decisions are already made. Nobody needs to be converted to add support for a certain encoding.
There is an improved UTF-8 solution already in LCL and UTF-16 is being worked on. Like a miracle it seems possible to support both.

Delphi compatibility is important and thus UTF-16 must be supported, no doubt. Delphi is again gaining popularity and every serious Delphi developer must care about Unicode. Patito wrote nonsense about this issue.
Anyway let's keep the technical facts and terms straight, "code unit" and "code point" and all.
One misconception must be corrected because it keeps popping up : codepoints in UTF-16 are not fixed width and they must not be treated as such. Yes, typical Delphi code does so and thus it is broken. It ignores >35000 codepoints which is a bug.
It also means UTF-16 has no speed advantage here. A proper code must check for surrogate pairs which makes it more complex and slower also when it does not find any.
Looking at my own code I think it is faster with UTF-8 but it is only a "gut feeling", I did not make exact measurements.
In general we can say that both encodings are good enough. In technical perspective this encoding war is quite useless.
The API compatibility issue has been exaggerated, too. Conversion between encodings is quite fast and plays only marginal role with API calls (says my gut feeling). This applies in both directions, for both Windows and Unix APIs.

What more, most parser code continues to work with old ASCII concept regardless of encoding. HTML, XML, BB (bulletin board), SQL etc. use tags and keywords in ASCII-area. A parser typically does not process the data between tags.
Even code that deals with human languages may not need to iterate characters very often. The Unicode specific stuff is often encapsulated in functions.
The problems have been exaggerated.
« Last Edit: June 09, 2015, 09:25:41 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018