Recent

Author Topic: Need help understanding the effects of Unicode  (Read 15513 times)

sfeinst

  • Full Member
  • ***
  • Posts: 235
Need help understanding the effects of Unicode
« on: June 06, 2015, 10:10:24 pm »
I am a novice when it comes to unicode - very ANSI-centric with my experience.  I am using Lazarus 1.4 on Windows 8.1.  I have an application that contains a TMemo.  I've written code that allows a user to select multiple lines, press TAB and have the lines shift in (like a text editor would).  This has been fine for me for a while.  Recently, I copied text from a web page and pasted it into the TMemo.  This particular text does not work well with my shift code.  I looked at the text with a hex editor an noticed a few locations (6 to be specific) where a space should be, was instead 2 bytes (which I looked up and saw was a unicode no-break space).

I stepped through my code and noticed that when I called the TMemo's selstart method, it was returning a value 6 places too small if I was counting by bytes.  So it appears to me that the selstart treats the unicode as one character - which is probably correct handling for unicode.  My code makes use of the Length function.  This function is returning the full length of the TMemo.Text (not shortening the value because of the unicode), so is basically 6 characters too long if used in conjunction with the selstart.  This combination is breaking my code.

I'm not looking for someone to debug my function.  I'll work on it.  What I would like is advice on the correct way to deal with unicode.  So a few questions (and let me know if I am going down the wrong path):
1) Should Length be returning the count of bytes not characters?  Is that even what it is doing?
2) Since SelStart seems to be based on characters not bytes and Length is bytes not characters (assuming I am understanding this at all), what are some ways to handle this? 
3) Should I be going through some conversion functions?  I'm guessing I could convert the text to ANSI prior to pasting from the clipboard, but that seems to me that it would make my app non-global.
4) Should I replace all instances of String in my app with WideString?  Would that even help?

Appreciate any comments on the topic.

Thanks

Bart

  • Hero Member
  • *****
  • Posts: 5531
    • Bart en Mariska's Webstek
Re: Need help understanding the effects of Unicode
« Reply #1 on: June 06, 2015, 11:04:46 pm »
This is an off-topic (sort of) answer, but you do know that TSynEdit has a function that indents a selected block of text?
(No need to re-invent it.)

Bart

sfeinst

  • Full Member
  • ***
  • Posts: 235
Re: Need help understanding the effects of Unicode
« Reply #2 on: June 06, 2015, 11:26:08 pm »
Yes I do.   :)  Matter of fact, my original app version used TSynEdit, but I really wanted word-wrap so switched to TMemo.

ChrisF

  • Hero Member
  • *****
  • Posts: 542
Re: Need help understanding the effects of Unicode
« Reply #3 on: June 07, 2015, 03:24:26 pm »
Hereafter, a few simplified generalities concerning Unicode and Free Pascal/Lazarus (sorry if you already know them).


First of all, I don't know what you mean exactly by "Unicode", but Unicode should be considered as a standard with several specifications (see http://en.wikipedia.org/wiki/Unicode). Basically, it means that Unicode defines several character encoding types, and that all of them are belonging to the 'Unicode' standard.

So, when you say "Unicode", it may refer to different types of character encoding.


The 2 main Unicode character encoding sets concerning Free Pascal are UTF-8 and UTF-16:

- UTF-8: each "character" is coded using 1 to 4 bytes. For compatibility purpose with ASCII, all "character codes" up to 127 are encoded using 1 byte and compatible with the ASCII table: it means that for all these "characters", ASCII = UTF-8.

- UTF-16: each "character" is coded using 1, or eventually 2 couple of bytes (i.e. 2 or 4 bytes).

Note: "character" is an improper term when dealing with Unicode specifications; "code point" should be used instead. I've used "character" for a better comparison purpose, as you are familiar with ASCII/ANSI.


Back to Free Pascal/Lazarus:

. Lazarus and the LCL are using UTF-8 as a standard "everywhere" by default: strings, source code, form source, text control properties, .... In the LCL, by default the 'String' type is identical to the 'UTF8String' type.

. Free Pascal: up to the (current) 2.6.4 version, 'String' means 'AnsiString'. Practically, it means that the RTL functions are using ANSI strings by default; and that you eventually may have to convert your strings when calling them inside Lazarus (i.e. UTF-8->ANSI, ANSI->UTF-8). Free Pascal versions after 2.6.4 (i.e. 2.7.x/2.8.x/3.0) offer a better support of Unicode.

. Windows: the Windows API uses ANSI (system code page, in fact) or WideString types. The current Free Pascal version uses the ANSI API version, while the LCL is able to use both the ANSI and the WideString versions. The LCL is doing internally the conversion UTF-8<-->ANSI or UTF-8<-->WideString when dealing with the Windows API. For simplification, the WideString type may be "assimilated" to the UTF-16 type, though technically it's not exactly true, due to surrogate processing differences (as far as I've understood the trick).

For more pieces of information:
- http://wiki.freepascal.org/Character_and_string_types
- http://wiki.freepascal.org/UTF8_strings_and_characters
- http://wiki.freepascal.org/FPC_Unicode_support
- http://wiki.freepascal.org/LCL_Unicode_Support


Your case:

Sorry, I can't answer directly to your questions (except for the last one = 4: it would be a very bad idea).

As far as I understand, your problem is coming from the fact that you use "non UTF-8 aware" functions like "Length" with UTF-8 strings. If you have a look at the 2nd of my links (http://wiki.freepascal.org/UTF8_strings_and_characters), you'll see that you may or may not have to use specific UTF-8 functions (or conversions), depending of what you are doing.

The most incorrect assumption with UTF-8 strings is considering that characters are always 1 byte long in UTF-8 strings. For instance "Pos(searchcharacter, wholestring) + 1" won't give you -necessarily- the position of the next character after searchcharacter in wholestring.

And don't mix "UTF-8 aware" functions/properties (like the "SelStart" property) with "non UTF-8 aware" functions (like "Pos" or "Length") without any precautions.

I guess that using wisely UTF8 versions of the Free Pascal functions you are using, might be a solution for you: like UTF8Pos, UTF8Length, ... Only when it's really of course, you don't have to modify all your code everywhere.


For instance, with a string containing the non ASCII characters "é" and "à" (project having a form with a pushbutton and a memo controls):

Code: [Select]
uses
  ..., LazUTF8;

procedure TForm1.Button1Click(Sender: TObject);
var Stru: String; // = UTF8String
var Stra: AnsiString;
var Strw: WideString;
//
var OutRes: String;
begin
  Stru := '1234567890éabcdeàfghij';
  Stra := UTF8ToAnsi(Stru);
  Strw := UTF8Decode(Stru);
  //
  OutRes := '';
  OutRes := OutRes + 'Length Stru=' + IntToStr(Length(Stru)) + '  UTF8Length Stru=' + IntToStr(UTF8Length(Stru)) + sLineBreak;
  OutRes := OutRes + 'Length Stra=' + IntToStr(Length(Stra)) + sLineBreak;
  OutRes := OutRes + 'Length Strw=' + IntToStr(Length(Strw)) + sLineBreak + sLineBreak;
  //
  OutRes := OutRes + 'Pos é (non UTF8)=' + IntToStr(Pos('é', Stru)) + sLineBreak;
  OutRes := OutRes + 'Pos é (UTF8)=' + IntToStr(UTF8Pos('é', Stru)) + sLineBreak + sLineBreak;
  OutRes := OutRes + 'Pos à (non UTF8)=' + IntToStr(Pos('à', Stru)) + sLineBreak;
  OutRes := OutRes + 'Pos à (UTF8)=' + IntToStr(UTF8Pos('à', Stru)) + sLineBreak + sLineBreak;
  //
  OutRes := OutRes + 'Extract (non UTF8)=' + Copy(Stru, Pos('é', Stru) + 1, 4) + sLineBreak;
  OutRes := OutRes + 'Extract (non UTF8 OK)=' + Copy(Stru, Pos('é', Stru) + Length('é'), 4) + sLineBreak;
  OutRes := OutRes + 'Extract (UTF8)=' + UTF8Copy(Stru, UTF8Pos('é', Stru) + 1, 4) + sLineBreak + sLineBreak;
  //
  Memo1.Text := OutRes;
end;

The result is:
Code: [Select]
Length Stru=24  UTF8Length Stru=22
Length Stra=22
Length Strw=22

Pos é (non UTF8)=11
Pos é (UTF8)=11

Pos à (non UTF8)=18
Pos à (UTF8)=17

Extract (non UTF8)=?abc
Extract (non UTF8 OK)=abcd
Extract (UTF8)=abcd

Note the incorrect result for 'Extract (non UTF8)', and the differences of the Length/UTF8Length and Pos/UTF8Pos results.
« Last Edit: June 07, 2015, 03:26:41 pm by ChrisF »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10900
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #4 on: June 07, 2015, 06:19:48 pm »
Quote
- UTF-8: each "character" is coded using 1 to 4 bytes.
Quote
Note: "character" is an improper term when dealing with Unicode specifications;

Many characters are just one codepoint, but some are 2 or more. (search "combining codepoints" (or combining character) and "surrogate pairs"

However I am pretty sure SelStart deals in codepoints. And so does Utf8Length, Utf8Pos.

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Need help understanding the effects of Unicode
« Reply #5 on: June 07, 2015, 07:10:50 pm »
Mind you processing UTF8 is slower than UTF16.. UnicodeString has its length stored in Length field in the record while
UTF8 has to be counted each time.

Compared to QT which is fully UTF16. I also didn't know that LCL is converting each Windows API call. It will make it painfully slow..
« Last Edit: June 07, 2015, 07:40:49 pm by Fiji »

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10900
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #6 on: June 07, 2015, 08:02:36 pm »
Mind you processing UTF8 is slower than UTF16..

If you are interested in codepoints, instead of chars, then thats true. Though even then, how often do you need the length? Many string operations, such as searching or replacing iterate over all chars (can use byte-len), or up to an abort condition. for those the speed diff is negligible.

If you need actual chars, then utf16 is variable length too. combining marks and surrogates means that in utf16 there are chars with 4,6 or maybe more bytes.
« Last Edit: June 07, 2015, 08:10:26 pm by Martin_fr »

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Need help understanding the effects of Unicode
« Reply #7 on: June 07, 2015, 08:07:08 pm »
Various string processing functions become slower. It is the usual memory vs processor argument. Do you want to use less memory or do you want faster processing?
In C++ land they have to speed UTF8 up with some serious SIMD techniques to get some performance out of it.

http://woboq.com/blog/utf-8-processing-using-simd.html

Its only usefull for network protocols etc.
« Last Edit: June 07, 2015, 08:15:48 pm by Fiji »

sfeinst

  • Full Member
  • ***
  • Posts: 235
Re: Need help understanding the effects of Unicode
« Reply #8 on: June 07, 2015, 08:13:43 pm »
@ChrisF - thanks for the detailed information.  That kind of information was what I was looking for.  I'll have to look at my code and make sure I am not basing too much on bytes so I can convert.  But it definitely gives me a starting point.

@Martin_fr and @Fiji, thank you both as well.  The more info the better.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10900
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #9 on: June 07, 2015, 08:22:57 pm »
no this is not about saving mem. Most none latin text will need more len in utf8 than in utf16.

My argument is, that it is a question of algorithm.

This will be slow:
Code: [Select]
For a := 1 to utf8length(s) do Foo(utf8Copy(s,a,1));
The above would have worked utf16, though in both utf16 and utf8 the above processes codepoints, not chars!)

For utf8 the below code does the same.
Yes it will still be slower, but only a tiny bit.
Code: [Select]
  a := 1;
  while a <= length(s) do begin
    i := UTF8CharacterLength(@s[a]);
    Foo(@s[a],i));
    a := a + i;
  end;

But this is a digression from the OT.
« Last Edit: June 07, 2015, 08:26:12 pm by Martin_fr »

taazz

  • Hero Member
  • *****
  • Posts: 5368
Re: Need help understanding the effects of Unicode
« Reply #10 on: June 08, 2015, 12:06:24 am »
no this is not about saving mem. Most none latin text will need more len in utf8 than in utf16.

My argument is, that it is a question of algorithm.

This will be slow:
Code: [Select]
For a := 1 to utf8length(s) do Foo(utf8Copy(s,a,1));
The above would have worked utf16, though in both utf16 and utf8 the above processes codepoints, not chars!)

For utf8 the below code does the same.
Yes it will still be slower, but only a tiny bit.
Code: [Select]
  a := 1;
  while a <= length(s) do begin
    i := UTF8CharacterLength(@s[a]);
    Foo(@s[a],i));
    a := a + i;
  end;

But this is a digression from the OT.
the difference is that in utf16 the code points and characters are the same thing in 99% of languages, including greek, russian and other exotic (aka non latin based) alphabets. The minor change on the second loop can be done in utf16 too there is nothing preventing us for using the same logic in utf16 as your utf8 example and you are api native as a plus.
I don't see anything that should allow us to choose utf8 or utf16 in your examples other than the fact that you encounter the multi code point situation a lot sooner in utf8.

Good judgement is the result of experience … Experience is the result of bad judgement.

OS : Windows 7 64 bit
Laz: Lazarus 1.4.4 FPC 2.6.4 i386-win32-win32/win64

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4565
  • I like bugs.
Re: Need help understanding the effects of Unicode
« Reply #11 on: June 08, 2015, 12:07:13 am »
Mind you processing UTF8 is slower than UTF16.. UnicodeString has its length stored in Length field in the record while
UTF8 has to be counted each time.

That is false information. It assumes that UTF-16 has fixed width code points. It does not. A code point can consist of one or two 16-bit code units, called UnicodeChar in Delphi. Treating them right requires a similar logic than UTF-8 does.
Unfortunately you are not the only person with this misconception. There is lots of sloppy UTF-16 code that treats everything as fixed width.
Such code is broken with many special symbols, for example a wine glass graph :
  http://www.fileformat.info/info/unicode/char/1f377/index.htm

Quote
Compared to QT which is fully UTF16. I also didn't know that LCL is converting each Windows API call. It will make it painfully slow..

LCL uses UTF-8 but is not painfully slow. Your information is false again.
My feeling is that code dealing with UTF-8 can be faster because often you don't need UTF-8 specific functions due to special properties of UTF-8.
See details here:
  http://wiki.freepascal.org/UTF8_strings_and_characters

At least in Delphi all string functions became a lot slower after they switched to UTF-16, even the functions dealing with only ASCII-range characters.

For anybody who now ports his ASCII code to UTF-8, I recommend the improved support in Lazarus:
  http://wiki.freepascal.org/Better_Unicode_Support_in_Lazarus
Now you need development versions of FPC and Lazarus for it, but it is worth it.

P.S.
Why the misconception about fixed width UTF-16 characters comes up again and again? The facts are well explained in Unicode documentation.
One reason may be that Delphi and other tools claim that old ASCII-string code continues to work as-is with new Unicode text which is not true. It is a nasty marketing trick.
« Last Edit: June 08, 2015, 12:20:21 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

rtusrghsdfhsfdhsdfhsfdhs

  • Full Member
  • ***
  • Posts: 162
Re: Need help understanding the effects of Unicode
« Reply #12 on: June 08, 2015, 09:46:43 am »
Quote
At least in Delphi all string functions became a lot slower after they switched to UTF-16, even the functions dealing with only ASCII-range characters.

They became slower because they ported a lot of code from ASM -> PURE PASCAL.

 UTF-8 was made with storing in mind, not processing. UTF-8 processing requires complex state machines and is O(N^2)

Quote
My feeling is that code dealing with UTF-8 can be faster because often you don't need UTF-8 specific functions due to special properties of UTF-8.
See details here:
  http://wiki.freepascal.org/UTF8_strings_and_characters

Sure if you process only English text then I guess yes.

Quote
That is false information. It assumes that UTF-16 has fixed width code points. It does not. A code point can consist of one or two 16-bit code units, called UnicodeChar in Delphi. Treating them right requires a similar logic than UTF-8 does.
Unfortunately you are not the only person with this misconception. There is lots of sloppy UTF-16 code that treats everything as fixed width.
Such code is broken with many special symbols, for example a wine glass graph :
  http://www.fileformat.info/info/unicode/char/1f377/index.htm

Other than that symbol and a couple super rare ones it is fixed width. Compared to UTF8 when it is not as soon as you add a bit of Cyrilic or Chinese..

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10900
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #13 on: June 08, 2015, 11:19:43 am »
the difference is that in utf16 the code points and characters are the same thing in 99% of languages
Quote
other than the fact that you encounter the multi code point situation a lot sooner in utf8.
Minor correction, afaik that is true equally for utf8 and 16. They have the same codepoints. The difference is that in utf8 a codepoint has a variable amount of bytes.
So the likelihood of multi-codepoint is the same for both.

Also the existence of pre-composed chars (e.g. for accented) does not guarantee the use of those. Unless you know the source of your text, you may well encounter decomposed chars (2 codepoints) in many European languages (incl French, German, and others)

Quote
I don't see anything that should allow us to choose utf8 or utf16 in your examples
I didnt make an argument for utf8. I simply corrected a point against it. The speed argument is in many cases over-exaggerated. That said, there may be cases where it can apply. There are also cases where utf8 is faster (because pure English text needs less memory in utf8, and may reduce cache misses).

Quote
you are api native as a plus
Depends on your OS. Afaik win API is utf16. Linux is utf8. But I would have to double check that.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10900
  • Debugger - SynEdit - and more
    • wiki
Re: Need help understanding the effects of Unicode
« Reply #14 on: June 08, 2015, 11:31:44 am »
That is false information. It assumes that UTF-16 has fixed width code points.
Quote
Such code is broken with many special symbols, for example a wine glass graph :
  http://www.fileformat.info/info/unicode/char/1f377/index.htm

Actually in UTF8 codepoints do have a fixed wide (1 word = 2 bytes). But characters still are of variable length. And that can (as in optionally) apply to much more common examples such as accented chars, umlauts, and others.

About the wine glass. This is a surrogate pair. So technically it is 2 codepoints (even so afaik neither of them can stand alone in vaild utf16)


 

TinyPortal © 2005-2018