Recent

Author Topic: A simple sane question to end insanity! TStringList in unicode mode  (Read 24005 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #45 on: August 18, 2017, 08:42:00 am »
It is ASCII at least for POSIX locale. Search for ASCII here.
Yes, for POSIX. ASCII, for UNICODE it is Ansi bases on the codepages I mentioned for 128-255.[/quote]
See also the Unicode specifications.
Where did you get that from?
Quote
Don't remember. I'll see if I can find it. But it is easy to check your locale, either do the comparison using your system API, like CompareString on Windows, or better get the weights used in the final binary comparison.
I did.... hence mantris entry..
Quote
Again, Windows CompareString, and wcstring's strcoll both are locale dependent. The locales on your systems are reversing the order of capital/small letters. Test my statement using CompareString/strcoll.
Windows ComparestrW or CompareStringEx should be used: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317759(v=vs.85).aspx
Both sort as expected!
Specialize a type, not a var.

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #46 on: August 18, 2017, 08:49:37 am »
Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'.

Thank you!

 8-) <sigh> C# does not suffer this bug. EganSolo reached the opposite conclusion for the information given:
Quote
Quote
A string is a sequential collection of Unicode characters that is used to represent text. The value of the String object is the content of the sequential collection of characters

Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'.
The latter means on most latin codepages C# treats 'a' >' 'B' because 'B' has a lower sequential index in the ASCII compatible part of the unicode tables.
In WinAPI the result is correct
In C# the result is correct
In Java the result is correct.
In Delphi the result is correct
In Freepascal it fails


Note compare functions can be both case sensitive and case insensitive. That should be obeyed. For TStringlist to work as intended and transparently regardless of a string manager it should use e.g. LOCALE_INVARIANT on Windows. That's the whole point...
A TStringlist is not a display format. And A widestring mananager should not affect AnsiString if the default stringtype is Ansi.

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/Classes_TStringList_Sort.html

I am fully aware of:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_AnsiCompareStr.html
Which may have lead to this misunderstanding. Point is: something should be done, either docs or code
« Last Edit: August 18, 2017, 09:06:20 am by Thaddy »
Specialize a type, not a var.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #47 on: August 18, 2017, 10:39:35 am »
By the way, as I wade my way through this unicode stuff, I'll try to write a beginner's summary to what I'm learning here. The links you provided me are making more and more sense but my impression is that they are not written for beginners  :).
The page I linked only explains how to use Unicode in Lazarus. One has to know something about Unicode before reading it.
The easy summary part for beginners is in the "Usage" section.
  http://wiki.freepascal.org/Unicode_Support_in_Lazarus#Usage
It is enough to know for many (most?) users.

Are you planning to write a beginner's summary of Unicode in general? Good luck!
The problem is that Unicode is complex. To make it look easy you must leave out lots of information.
Internet is full of Unicode related pages. You could also look for the best ones and maybe link them to the "See Also" section of our wiki-page. Now it has only one link for FPC's Unicode support.

Quote
I'll upgrade to 1.8, go through the LazUnicode unit and I'll post back here some of my own observations.
See also the test project in  components/lazutils/test/LazUnicodeTest.lpi  to get an idea how to use it.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #48 on: August 18, 2017, 10:53:58 am »
Quote
Are you planning to write a beginner's summary of Unicode in general? Good luck!

No. Not really. Indeed, I have read several pages on unicode and several dealing with utf-8. I don't plan on delving into these details. I think what is required is something that addresses the questions I've had to grapple with as indicated below. I would love to read what you think about these and how best to address them? Again, it may be because I haven't yet fully understood how to leverage the LazUnicode unit?

Quote
A typical error when you don't know enough about Unicode. :(
You treat UTF-16 as a fixed width encoding while it actually is variable width.
The code is wrong for about half of the defined codepoints.
You may want to look at the unit LazUnicode in package LazUtils. It lets you make robust and encoding agnostic code which supports also Delphi.

So then, I upgraded to 1.8 and went through the LazUnicode as well as LazUtf8 and LazUtf8SysUtils. I think I get most of it but there are still some points of confusion that perhaps this link http://wiki.lazarus.freepascal.org/Unicode_Support_in_Lazarus should address. Note that these may well be due to the novelty of Unicode for me but I found them confusing and chances are other coders might also

  • When including LAzUtf8 as my first unit (See second program below), StringOfChar does not map to UTF8StringOfChar. Why? In general, what is the list of string functions that won't map versus the list that would? Is it only the list found in LazUnicode? Why only this list.
  • Since Ord is a function returning a LongInt, why can't we have it work as required under utf-8 by returning the hex value across all bytes representing a unicode character? I don't get it.
  • Iterating through a string using an integer index does not work: This is perhaps the hardest one to deal with: We're so used to write For i := 1 to length(S) and it should be clearly mentioned, unless of course, I've done something wrong in the next program

Code: Pascal  [Select][+][-]
  1. program TryUnicode_IntForLoopBroken;
  2. uses LazUtf8, LazUnicode;
  3. Var S: String;
  4.     i: integer;
  5. begin
  6.   S := 'Éternité et sérénité:äâôöëêîï';
  7.   Writeln('S = ', S);
  8.   For i := 1 to Length(S) - 1 do
  9.      Write(S[i], ' -- ');
  10.   Writeln(S[length(S)]);
  11.   Readln;
  12. end.
  13.  

Now, compare that code to this one:
Code: Pascal  [Select][+][-]
  1. program TryUtf16_loop;
  2. {$ModeSwitch unicodestrings}
  3. {$Codepage utf-8}
  4. Var S  : String;
  5.     Ch : Char  ;
  6.     i: integer;
  7.  
  8. begin
  9.   S := 'éternité.â,ä';
  10.   Writeln(S);
  11.   For i := 1 to length(S) - 1 do
  12.   begin
  13.    Ch := S[i];
  14.    Write(ch, '--');
  15.   end;
  16.   Writeln(S[length(s)]);
  17.   Readln;
  18. end.
  19.  

Admittedly, this code is using utf-16 but it (seems) to work? Is it possible that for certain planes this code would break? So, I'm not certain now of the benefits of using utf-8 based unicode versus utf-16, except I guess when I need to interface with the user interface component which are utf-8 based?

Lastly, in trying to understand how to work with utf-8 unicode strings, I rewrote my own CompareStr for educational purposes. I would love to know where this one breaks. Notice that if we could get the integer iterator to work as expected, this educational function would be a lot easier to understand:
Code: Pascal  [Select][+][-]
  1. program TryUnicode_With_UTF8;
  2. uses LazUtf8, LazUnicode, LazUTF8SysUtils, SysUtils;
  3.  
  4. Function MyCompareStr_Educational(const S1, S2 : String) : Integer;
  5. var Len1  , Len2: Integer;
  6.     E1    , E2  : TCodePointEnumerator;
  7.     i           : integer ;
  8. begin
  9.   Len1 := Length(S1);
  10.   Len2 := Length(S2);
  11.   If Len1 = 0
  12.   then If Len2 = 0
  13.        then Result := 0
  14.        else Result := 1
  15.   else If Len2 = 0
  16.        then Result := -1
  17.        else begin
  18.          Result := 0;
  19.          E1 := TCodePointEnumerator.Create(S1);
  20.          E2 := TCodePointEnumerator.Create(S2);
  21.          While (E1.MoveNext and E2.MoveNext) do
  22.          begin
  23.            Len1 := Length(E1.Current);
  24.            Len2 := Length(E2.Current);
  25.            //Unicode utf-8 char with lower number of bytes have a smaller ordinal value
  26.            //than unicode utf-8 chars with upper ordinal values.
  27.            Result := Len1 - Len2;
  28.            If Result = 0 //same length, more is required.
  29.            then For i := Len1 downto 1 do
  30.                 begin
  31.                   Result := Ord(E1.Current[i]) - Ord(E2.Current[i]);
  32.                   If Result <> 0 then break;
  33.                 end;
  34.          end;
  35.          E1.Free;
  36.          E2.Free;
  37.        end;
  38. end;
  39.  
  40. var S1, S2 : String;
  41.      Res    : integer;
  42. begin
  43.   //Why isn't StringOfChar mapped to UTF8StringOfChar? Why must I call it explicitly?
  44.   S1  := UTF8StringOfChar(ch,10) + String('a');
  45.   S2  := UTF8StringOfChar(ch,10) + String('B');
  46.   Res := MyCompareStr_Educational(S1,S2);
  47.   Write('S1 = ', S1 , ' and S2 = ', S2, ' and S1 ');
  48.   If Res < 0
  49.   then Writeln(' < S2')
  50.   else Writeln(' > S2');
  51.   readln;
  52. end.
  53.  

And here is the equivalent program written with utf-16. Frankly, I'm not certain which is better, assuming any of it works :)

Code: Pascal  [Select][+][-]
  1. program TryUnicode_With_UTF16;
  2. {$ModeSwitch unicodestrings}
  3. {$Codepage utf-8}
  4.  
  5. Function MyCompareStr_Educational(const S1, S2 : String) : Integer;
  6. Type
  7.   MyRec = Record
  8.     case byte of
  9.      0 : (ch : Char);
  10.      1 : (b1, b2, b3, b4 : byte);
  11.   end;
  12.  
  13. var Len1  , Len2: Integer;
  14.     Len         : integer;
  15.     i           : integer;
  16.     ch1   , ch2 : MyRec  ;
  17. begin
  18.   Len1 := Length(S1);
  19.   Len2 := Length(S2);
  20.   If Len1 = 0
  21.   then If Len2 = 0
  22.        then Result := 0
  23.        else Result := 1
  24.   else If Len2 = 0
  25.        then Result := -1
  26.        else begin
  27.          Result := 0;
  28.          Len := Len1;
  29.          If Len > Len2
  30.          then Len := Len2;
  31.          For i := 1 to Len do
  32.          begin
  33.            ch1.ch := S1[i];
  34.            Ch2.ch := S2[i];
  35.            Result := ch1.b4 - ch2.b4;
  36.            If Result <> 0 then exit;
  37.            Result := ch1.b3 - ch2.b3;
  38.            If Result <> 0 then exit;
  39.            Result := ch1.b2 - ch2.b2;
  40.            If Result <> 0 then exit;
  41.            Result := ch1.b1 - ch2.b1;
  42.            If Result <> 0 then exit;
  43.          end;
  44.        end;
  45. end;
  46. function UStringOfChar(ch : Char; Len : SizeInt) : String;
  47. var i  : integer;
  48.     P  : pointer;
  49.     S  : integer;
  50. begin
  51.  S := SizeOf(Char);
  52.  SetLength(Result, Len);
  53.  p := Pointer(Result);
  54.  For i := 1 to Len do begin
  55.   Move(ch,p^,S);
  56.   System.Inc(p,S);
  57.  end;
  58. end;
  59.  
  60. var
  61.     S1, S2 : String;
  62.     Res    : integer;
  63. begin
  64.   S1  := UStringOfChar('ö', 10) + String('a');
  65.   S2  := UStringOfChar('ö', 10) + String('B');
  66.   Res := MyCompareStr_Educational(S1,S2);
  67.   Write('S1 = ', S1 , ' and S2 = ', S2, ' and S1 ');
  68.   If Res < 0
  69.   then Writeln(' < S2')
  70.   else Writeln(' > S2');
  71.   readln;
  72. end.
  73.  

Thus far, I find my experience with Unicode (utf-8 / utf-16) to be a bit unsettling. I'm not sure which function I could use in a transparent way via LazUnicode versus which one I must invoke with the utf8 prefix. I do get that the string based iterator functions like an integer iterator but the habit is set and it is hard to move away from it. On the flip side, utf-16 seems to be a bit more streamlined but then again I had to code my own StringOfChar, which leaves me wondering how many of these functions I would have to code versus how many are actually available. Also, I'm not certain that the code that I wrote does handle planes requiring four bytes instead of two to represent certain characters.
« Last Edit: August 18, 2017, 12:06:49 pm by EganSolo »

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #49 on: August 18, 2017, 12:47:55 pm »
As long as your strings adhere to the code plane of the original UTF16 (which is about equal to the older fixed length UCS-2) that first example works indeed. In 1992/3 people decided that 65536 characters was not enough after all and so UTF-16 became a variable sized string type instead of the originally intended fixed string type.
For Latin languages less so than for exotics and some really specific kinds of scientific notation, and emoij's.

After the 1992 change all hell broke loose ... :D You can also take a look at Python, which made the decent choice and defaults to UTF32 in most internal representations and simply converts on display time, UTF32 is fixed length and branchless....
UTF16 isn't anymore, but most code can assume it's fixed length unless you need to handle post-1992 extensions to the original specification..
« Last Edit: August 18, 2017, 12:55:41 pm by Thaddy »
Specialize a type, not a var.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #50 on: August 18, 2017, 12:57:33 pm »
It is ASCII at least for POSIX locale. Search for ASCII here.
Yes, for POSIX. ASCII, for UNICODE it is Ansi bases on the codepages I mentioned for 128-255.
See also the Unicode specifications.
Where did you get that from?
Quote
Don't remember. I'll see if I can find it. But it is easy to check your locale, either do the comparison using your system API, like CompareString on Windows, or better get the weights used in the final binary comparison.
I did.... hence mantris entry..
Quote
Again, Windows CompareString, and wcstring's strcoll both are locale dependent. The locales on your systems are reversing the order of capital/small letters. Test my statement using CompareString/strcoll.
Windows ComparestrW or CompareStringEx should be used: https://msdn.microsoft.com/en-us/library/windows/desktop/dd317759(v=vs.85).aspx
Both sort as expected!


CompareStrW does not take the locale into consideration. This implies that you do believe that the unexpected results are due to the locale, or you tested the wrong function, which explains your confusion, am I right? or Did you mean CompareStringW which according to my simple test:
Code: Pascal  [Select][+][-]
  1. Procedure Test_CompareStringW(s1,s2:String);
  2. const
  3.   AN:Array[0..3] of String =('Failed','CSTR_LESS_THAN','CSTR_EQUAL','CSTR_GREATER_THAN');
  4. var
  5.   us1,us2: UnicodeString;
  6. begin
  7.   us1 := s1;
  8.   us2 := s2;
  9.  
  10.   WriteLn(s1,' ',AN[CompareStringW(LOCALE_SYSTEM_DEFAULT, 0,@us1[1],Length(us1),@us2[1],Length(us2))],' ',s2);
  11. end;

Testing:
Code: Pascal  [Select][+][-]
  1.   Test_CompareStringW('100','200');
  2.   Test_CompareStringW('200','100');
  3.   Test_CompareStringW('100','100');
  4.   Test_CompareStringW('a','B');
  5.   Test_CompareStringW('B','a');
  6.   Test_CompareStringW('a','a');

Result:
Quote
100 CSTR_LESS_THAN 200
200 CSTR_GREATER_THAN 100
100 CSTR_EQUAL 100
a CSTR_LESS_THAN B
B CSTR_GREATER_THAN a
a CSTR_EQUAL a

Meaning the order is:
Quote
100,200,a,B

lowercase before uppercase, unlike ANSI/ASCII.

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #51 on: August 18, 2017, 01:04:30 pm »
lowercase before uppercase, unlike ANSI/ASCII.
And that is the bug... I am fully aware of the technical details, but installing a widestring manager should not cause sort-order changes to AnsiString. The rest is overcomplicating things.

The widestring or unicodestring manager should not affect the AnsiString internals in any way. They are different beasts... We do not even have a TStringlist for unicode in classes yet...
« Last Edit: August 18, 2017, 01:07:15 pm by Thaddy »
Specialize a type, not a var.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #52 on: August 18, 2017, 01:05:30 pm »
Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'.

Thank you!

 8-) <sigh> C# does not suffer this bug. EganSolo reached the opposite conclusion for the information given:

Again, his result: 'a' and 'B' returns 'a' < 'B'
lowecase a before uppercase B, unlike the expected ANSI/ASCII order.

Quote
Quote
A string is a sequential collection of Unicode characters that is used to represent text. The value of the String object is the content of the sequential collection of characters

Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'.
The latter means on most latin codepages C# treats 'a' >' 'B' because 'B' has a lower sequential index in the ASCII compatible part of the unicode tables.

You reversed the direction. Intentional or mistake?

In WinAPI the result is correct
In C# the result is correct
In Java the result is correct.
In Delphi the result is correct
In Freepascal it fails


Note compare functions can be both case sensitive and case insensitive. That should be obeyed. For TStringlist to work as intended and transparently regardless of a string manager it should use e.g. LOCALE_INVARIANT on Windows. That's the whole point...

LOCALE_INVARIANT gave me the same results of LOCALE_SYSTEM_DEFAULT, lowercase before uppercase.

A TStringlist is not a display format. And A widestring mananager should not affect AnsiString if the default stringtype is Ansi.

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/Classes_TStringList_Sort.html

I am fully aware of:
http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/SysUtils_AnsiCompareStr.html
Which may have lead to this misunderstanding. Point is: something should be done, either docs or code

Thank you for finding the source. Quoting it here:
Quote
Note: Most locales consider lowercase characters to be less than the corresponding uppercase characters. This is in contrast to ASCII order, in which lowercase characters are greater than uppercase characters. Thus, setting S1 to 'a' and S2 to 'A' causees AnsiCompareStr to return a value less than zero, while CompareStr, with the same arguments, returns a value greater than zero.
« Last Edit: August 18, 2017, 01:12:39 pm by engkin »

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #53 on: August 18, 2017, 01:09:41 pm »
You still miss the point: the widestring manager should not affect AnsiString. AnsiString behavior is fixed, not variable.... <sigh> It breaks code... that's enough to file the bug.
Specialize a type, not a var.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #54 on: August 18, 2017, 01:23:51 pm »
You still miss the point: the widestring manager should not affect AnsiString. AnsiString behavior is fixed, not variable.... <sigh> It breaks code... that's enough to file the bug.

Breaks code, or unexpected sort order due to locale?

BeniBela

  • Hero Member
  • *****
  • Posts: 905
    • homepage
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #55 on: August 18, 2017, 01:48:20 pm »
That reminds me of the time i was using a  sorted TStringList as a map and then it suddenly stopped worked.

Because I was creating the TStringList in the initialization section and some other unit was installing a wide string manager in their initialization section, but after I had created the list.

Then AnsiCompareStr was another function during the list creation than it was using during the list usage, and the formerly sorted stringlist was unsorted according to the new compare function.

You still miss the point: the widestring manager should not affect AnsiString. AnsiString behavior is fixed, not variable....

But the widestring manager changes everything. It always has.

That is why I do not use the Ansi* functions and write all Unicode functions I need myself.


EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #56 on: August 18, 2017, 10:23:51 pm »
That is why I do not use the Ansi* functions and write all Unicode functions I need myself.

BeniBela, that was also the starting point of my woes. I'm writing an editor using TATSynEdit and decided to rely on my parser for syntax highlighting. TATSynEdit is unicode-based while my parser wasn't. I began receiving an avalanche of notes and warnings about conversion from Ascii to Unicode. Based on where I am in my code, these messages weren't critical but they were hiding other warnings that were. So I determined to clean-up the mess by converting my packages over to unicode. I was still on 1.6.4 but had stumbled upon a wiki-page or an answer in the forum that said to include {$modeswitch Unicodestrings}. Once I did that, TStringList began complaining about the mix-up between unicode and ascii strings. So I wrote my own unicode-compliant (or so I thought) replacement for TStringList. But then the test batteries I wrote for it broke when I tried to sort, which brought me here.

I wrote this summary to highlight that I was unaware of the pitfalls I was getting into when  I switched to unicode. I'm concerned that the learning curve is greater than it should be. The wiki page that JuhaManninen refers to is a good starting point but it could be improved: I added a discussion here: http://wiki.freepascal.org/Talk:Unicode_Support_in_Lazarus to this http://wiki.freepascal.org/Unicode_Support_in_Lazarus as a suggestion.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #57 on: August 19, 2017, 03:09:12 am »
When including LAzUtf8 as my first unit (See second program below), StringOfChar does not map to UTF8StringOfChar. Why? In general, what is the list of string functions that won't map versus the list that would? Is it only the list found in LazUnicode? Why only this list.
Which way does it not map? Do you mean LazUnicode has no similar function? I can add StringOfCodePoint() there. This should work, although is not optimized:
Code: Pascal  [Select][+][-]
  1. function StringOfCodePoint(ACodePoint: String; N: Integer): String;
  2. // Like StringOfChar
  3. var
  4.   i: Integer;
  5. begin
  6.   Result := '';
  7.   for i := 1 to N do
  8.     Result := Result + ACodePoint;
  9. end;

Quote
Since Ord is a function returning a LongInt, why can't we have it work as required under utf-8 by returning the hex value across all bytes representing a unicode character? I don't get it.
You must keep a variable length codepoint in a String. Ord does not work with strings.

Quote
Iterating through a string using an integer index does not work: This is perhaps the hardest one to deal with: We're so used to write For i := 1 to length(S) and it should be clearly mentioned, unless of course, I've done something wrong in the next program
It does work! You are then iterating codeunits, not codepoints. In many cases the codeunit resolution is usefull also with variable length encodings.
Did you look at this page?
 http://wiki.freepascal.org/UTF8_strings_and_characters
It is linked from the other page. It could be rewritten to use LazUnicode instead of LazUTF8.

In your case you must iterate codepoints. Using LazUnicode :
Code: Pascal  [Select][+][-]
  1.   for ch in s do
  2.     Do_your_thing_with(ch);
Note, it does not work right with decomposed accent marks. For that you must use TUnicodeCharacterEnumerator.

Quote
Admittedly, this code is using utf-16 but it (seems) to work? Is it possible that for certain planes this code would break?
Yes obviously. It only works with BMP. UTF-16 is a variable width encoding just like UTF-8.

Quote
So, I'm not certain now of the benefits of using utf-8 based unicode versus utf-16 ...
There is at least one big benefit: You must code right because the multi-byte codepoints are so common. Then as an extra bonus it supports all codepoints without exceptions.
UTF-16 is a good encoding when used right but unfortunately it is often not used right.
Wrong buggy usage of UTF-16 is promoted around internet. It is promoted even in this forum. :(
Even this forum software itself is buggy and does not support Unicode. I wanted to copy an example string outside BMP for you but I cannot. Just try it yourself in your next post.
Even commercial SW that is advertised as "Unicode aware" often is not. Maybe you start to understand how bad the situation is.

Quote
Lastly, in trying to understand how to work with utf-8 unicode strings, I rewrote my own CompareStr for educational purposes. I would love to know where this one breaks. Notice that if we could get the integer iterator to work as expected, this educational function would be a lot easier to understand:
Nice educational CompareStr. I have used AnsiCompareStr myself.
Integer iterator works as expected. What is the problem?

Quote
Thus far, I find my experience with Unicode (utf-8 / utf-16) to be a bit unsettling. I'm not sure which function I could use in a transparent way via LazUnicode versus which one I must invoke with the utf8 prefix.
We can add more functions to LazUnicode. What is missing?
BTW, UTF-8 and UTF-16 are only encodings for codepoints. Unicode is more complex than that. Getting codepoints right is easy with any encoding, honestly.

Quote
I do get that the string based iterator functions like an integer iterator but the habit is set and it is hard to move away from it.
Do you mean iterating codeunits versus iterating codepoints? They are both usefull. See the UTF8_strings_and_characters wiki page.

Quote
On the flip side, utf-16 seems to be a bit more streamlined ...
You mean UCS-2 is more streamlined? UTF-16 is a variable width encoding. UCS-2 is rather obsolete now. Even Windows has supported full Unicode for almost 18 years.
« Last Edit: August 19, 2017, 03:24:41 am by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #58 on: August 19, 2017, 04:32:15 am »
When including LAzUtf8 as my first unit (See second program below), StringOfChar does not map to UTF8StringOfChar. Why? In general, what is the list of string functions that won't map versus the list that would? Is it only the list found in LazUnicode? Why only this list.
Which way does it not map? Do you mean LazUnicode has no similar function? I can add StringOfCodePoint() there. This should work, although is not optimized:

I meant that calling StringofChar does not work the way Length or Copy or Pos work. I would like to continue to use StringofChar and not have to swtich to StringofCodePoint as you've done below. The same goes on for Ord. I get that Ord as it is implemented does not work for Unicode. Let me ask my question this way: Why is it that when I add LazUtf8 to my project, it is unable to map Ord and StringofChar and every other string related function to a unicode equivalent? This would be ideal. I'm certain there are very good reasons why this is not done, it's just not obvious.

Quote
Iterating through a string using an integer index does not work: This is perhaps the hardest one to deal with: We're so used to write For i := 1 to length(S) and it should be clearly mentioned, unless of course, I've done something wrong in the next program
It does work! You are then iterating codeunits, not codepoints. In many cases the codeunit resolution is usefull also with variable length encodings.

In your case you must iterate codepoints. Using LazUnicode :
Code: Pascal  [Select][+][-]
  1.   for ch in s do
  2.     Do_your_thing_with(ch);
  3.  
Note, it does not work right with decomposed accent marks. For that, you must use TUnicodeCharacterEnumerator.
[/quote]

You missed the point. In Delphi if I remember correctly, I could write For i := 1 to length(s) do S := 'Ç'; The compiler inherently understood Unicode and was able to do the right thing. I could also use while and repeat loops as well. Here, I've got to switch to a string based iterator and I don't think there's a transparent way to iterate over while and repeat loops with that iterator. You know all of this and you find it easy because you've been doing it for a while. For most of us this is bewildering :)
[/quote]

Quote
There is at least one big benefit: You must code right because the multi-byte codepoints are so common. Then as an extra bonus it supports all codepoints without exceptions.

Agreed. I do see that now.

Quote
Nice educational CompareStr. I have used AnsiCompareStr myself.
Integer iterator works as expected. What is the problem?

But if Unicode space is divided into 17 planes and since a given unicode character can belong to more than one plane where conceivably, its relative ordinal index in that plane is distinct from its absolute unicode ordinal index, how does my CompareStr continue to work? Is it because the RTL has already set the plane to say utf8 behind the scene and does this then changes the ordinal value of the charaters? That's where I'm not clear. In fact, I looked at your implementation of CompareStr: it relies on a CompareStrW which is clearly a call to the underlying Windows Os. I'm not clear why this was so.

Quote
Quote
Thus far, I find my experience with Unicode (utf-8 / utf-16) to be a bit unsettling. I'm not sure which function I could use in a transparent way via LazUnicode versus which one I must invoke with the utf8 prefix.
We can add more functions to LazUnicode. What is missing?

Well, am I to assume that all the routines in StrUtils work as is or only those where the type is String and not AnsiString? Since Strutils is part of RTL which is Utf-16, is there a requirement to transform from utf8 to utf16 before calling these routines?

Also which of the standard pascal operators work with unicode? =, > , <, <>? Again since these are managed by the compiler, should their argument be utf-16 encoded?

I'm sure to you all of these issues are obvious but for me as I begin to piece this puzzle back together, there's a lot of these questions that come-up. The link you provided below, is extremely useful. I saw it listed in the main wiki page but I thought it dealt with the technical details of utf-8.


Quote
Do you mean iterating codeunits versus iterating codepoints? They are both usefull. See the UTF8_strings_and_characters wiki page.

Quote
Quote
On the flip side, utf-16 seems to be a bit more streamlined ...
You mean UCS-2 is more streamlined? UTF-16 is a variable width encoding. UCS-2 is rather obsolete now. Even Windows has supported full Unicode for almost 18 years now.

To your point, what's tricky here is that almost every character I'm familiar with (Latin standard, French, Arabic, and Syriac) are represented in the UCS-2 encoding and for these it would work out of the box, so there's this tendency to believe it would work consistently. But then if the RTL is utf-16 based can I at least assume that utf-16 works as expected in the RTL?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #59 on: August 19, 2017, 12:07:35 pm »
I meant that calling StringofChar does not work the way Length or Copy or Pos work.
In a way it does. All those functions continue to work with CodeUnits which means Pascal "Char" type but still they are usefull with variable width encodings.
Think carefully why the function SplitInHalf() in the wiki example works with every valid Unicode string!
It uses the good old Pos() and Copy() instead of Unicode specific CodePointPos() and CodePointCopy().

The StringofCodepoint is a trivial function. It has "String" parameter and could be renamed StringofString which sounds stupid. Yes, it works with any string, not only a codepoint.

Quote
You missed the point. In Delphi if I remember correctly, I could write For i := 1 to length(s) do S := 'Ç'; The compiler inherently understood Unicode and was able to do the right thing.
Uhhh, nonsense! Delphi compiler does not understand Unicode any more than FPC does. It only understands "Char" type. All Unicode support is built into library code.
For example it has functions for UTF-16 surrogate pairs but people typically don't use them.  :(
Are you getting the idea: Using LazUnicode improves code quality because encoding agnostic code must be done right from the beginning.
BTW, your code makes no sense regardless of encoding:
Code: Pascal  [Select][+][-]
  1. For i := 1 to length(s) do S := 'Ç';

Quote
But if Unicode space is divided into 17 planes and since a given unicode character can belong to more than one plane where conceivably, its relative ordinal index in that plane is distinct from its absolute unicode ordinal index, how does my CompareStr continue to work? Is it because the RTL has already set the plane to say utf8 behind the scene and does this then changes the ordinal value of the charaters? That's where I'm not clear.
Indeed! You are totally confused!
Maybe you mix the Unicode planes with the old Windows codepages? No, the codepages are gone fortunately! Unicode is the same everywhere. It is a paradise compared to Windows codepages, although it has (collation etc.) rules based on locale.

Quote
In fact, I looked at your implementation of CompareStr: it relies on a CompareStrW which is clearly a call to the underlying Windows Os. I'm not clear why this was so.
You mean AnsiCompareStr which maps to UTF8CompareStr? It is originally from Mattias who is much more clever with Unicode than I am.
Anyway the behavior is compatible with Delphi. The Ansi...() functions support Unicode + its locale specific rules.

Quote
Since Strutils is part of RTL which is Utf-16, is there a requirement to transform from utf8 to utf16 before calling these routines?
RTL is not UTF-16. Where did you get that idea from?
Type "String" maps to "AnsiString" by default. Then LazUtils switches its encoding to UTF-8.
BTW, you don't need to convert explicitly between string types or encodings because FPC does it automatically.

Quote
I'm sure to you all of these issues are obvious but for me as I begin to piece this puzzle back together, there's a lot of these questions that come-up.
Just make sure you get the basics right. Now you had some wrong assumptions.
Then keep learning ...

Quote
To your point, what's tricky here is that almost every character I'm familiar with (Latin standard, French, Arabic, and Syriac) are represented in the UCS-2 encoding and for these it would work out of the box, so there's this tendency to believe it would work consistently. But then if the RTL is utf-16 based ...
It is not.
« Last Edit: August 19, 2017, 12:10:14 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018