Recent

Author Topic: A simple sane question to end insanity! TStringList in unicode mode  (Read 24004 times)

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #30 on: August 16, 2017, 03:52:15 pm »
Yes , but you and me understand the bug. He doesn't.
This was about printing É in his console where the acute went missing.
I also don't understand it. What caused it?
The sorting is a different issue.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #31 on: August 16, 2017, 04:04:02 pm »
1. {$modeswitch unicodeStrings} : If I don't include this switch, I get a raft of warning messages
Last time you explained the opposite. You got warning messages when you included {$modeswitch unicodeStrings}.
Which way is it?

Quote
3. My Case Insensitive method rely on this bit of code inside a loop
Code: Pascal  [Select][+][-]
  1.     V1 := Ord(S1[i]);
  2.     V2 := Ord(S2[i]);
  3.     If V1 = V2 then continue
  4.     else begin
  5.        Result := V1 - V2;
  6.        exit;
  7.     end;
  8.  

Why don't you just call AnsiCompareStr(S1, S2); ?

Quote
4. For Case Insensitive, the code is slightly modified as follows
Code: Pascal  [Select][+][-]
  1.     V1 := Ord(TCharater.ToLower(S1[i]));
  2.     V2 := Ord(TCharacterToLower(S2[i]));
  3.     If V1 = V2 then continue
  4.     else begin
  5.        Result := V1 - V2;
  6.        exit;
  7.     end;
  8.  
A typical error when you don't know enough about Unicode. :(
You treat UTF-16 as a fixed width encoding while it actually is variable width.
The code is wrong for about half of the defined codepoints.
You may want to look at the unit LazUnicode in package LazUtils. It lets you make robust and encoding agnostic code which supports also Delphi.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #32 on: August 16, 2017, 05:12:39 pm »
Why don't you just call AnsiCompareStr(S1, S2); ?
Because it looks currently broken? (Actually not there, but the string manager Big Time, it is a huge bug, because it breaks everything else too, like TStringlist.Sort)
But of course, normally: yes.

I am still trying to figure out why. (as per my mantis entries)
Specialize a type, not a var.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #33 on: August 16, 2017, 09:49:02 pm »
Because it looks currently broken? (Actually not there, but the string manager Big Time, it is a huge bug, because it breaks everything else too, like TStringlist.Sort)
Ok, I take your word for it.
I was however wondering how such a bug can go unnoticed until now. That's why I thought it was some Unicode specific sorting rule issue.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #34 on: August 16, 2017, 10:13:29 pm »
That's what I thought. Therefor I think it is a regression. But I haven't found it yet. It is reproducible over all platforms.
Specialize a type, not a var.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #35 on: August 17, 2017, 06:18:39 pm »
That's what I thought. Therefor I think it is a regression.
I don't think so.

But I haven't found it yet.
Because it is not a regression.

It is reproducible over all platforms.
Because all your tests that failed are based on locale. Windows, for instance, uses CompareString. Linux, on the other hand when you tried cwstring unit, uses strcoll. Both functions CompareString and  strcoll depend on the locale.

Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #36 on: August 17, 2017, 06:47:45 pm »
That should never fail with e.g. locale us.utf8. The collate should be  equal to the Ansi collate. Still it fails. That's the point.... For 0..127 us_US.UTF8 equals Ansi
« Last Edit: August 17, 2017, 06:50:35 pm by Thaddy »
Specialize a type, not a var.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #37 on: August 17, 2017, 08:49:43 pm »
To  JuhaManninen:

You wrote:

Quote
A typical error when you don't know enough about Unicode. :(
You treat UTF-16 as a fixed width encoding while it actually is variable width.


Thanks for the correction. Yes, I am new at the unicode world so that is helpful.
However, I searched through the LazUtils package and then in the Lazarus folder and could not find the unit LazUnicode.
Here's the relevant list of files of LazUtils when I display it in sorted order:

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #38 on: August 17, 2017, 09:35:22 pm »
That should never fail with e.g. locale us.utf8. The collate should be  equal to the Ansi collate. Still it fails. That's the point.... For 0..127 us_US.UTF8 equals Ansi

It did not fail. It did what it is supposed to do, but not what you expect. For instance, small letters are before capital letters, contrary to ASCII.

Thaddy

  • Hero Member
  • *****
  • Posts: 14204
  • Probably until I exterminate Putin.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #39 on: August 17, 2017, 10:25:21 pm »
That should never fail with e.g. locale us.utf8. The collate should be  equal to the Ansi collate. Still it fails. That's the point.... For 0..127 us_US.UTF8 equals Ansi

It did not fail. It did what it is supposed to do, but not what you expect. For instance, small letters are before capital letters, contrary to ASCII.
It is not Ascii, but full Ansi. the first 256 characters in Unicode (and ucs2) are even specified as win CP1252 and nix ISO-8859-1 (these are similar but not quite the same) .
That should *never* affect collate order.  And is the official specification.

Where did you get that from?

Collate order defaults to natural sort order for at least these first 128 characters and defaults also to natural order of cp1252/iso-8859-1 for the next 128.. Nothing  reversed.
« Last Edit: August 17, 2017, 10:31:00 pm by Thaddy »
Specialize a type, not a var.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4459
  • I like bugs.
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #40 on: August 18, 2017, 12:43:27 am »
However, I searched through the LazUtils package and then in the Lazarus folder and could not find the unit LazUnicode.
It is only in Lazarus 1.8. I guess you have 1.6.
You can download the needed files also from SVN, either trunk or fixes_1_8 branch, if you don't want to update Lazarus.
However I recommend Lazarus 1.8 RC4.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #41 on: August 18, 2017, 02:11:12 am »
That should never fail with e.g. locale us.utf8. The collate should be  equal to the Ansi collate. Still it fails. That's the point.... For 0..127 us_US.UTF8 equals Ansi

It did not fail. It did what it is supposed to do, but not what you expect. For instance, small letters are before capital letters, contrary to ASCII.
It is not Ascii, but full Ansi. the first 256 characters in Unicode (and ucs2) are even specified as win CP1252 and nix ISO-8859-1 (these are similar but not quite the same) .
That should *never* affect collate order.  And is the official specification.
It is ASCII at least for POSIX locale. Search for ASCII here.

Where did you get that from?
Don't remember. I'll see if I can find it. But it is easy to check your locale, either do the comparison using your system API, like CompareString on Windows, or better get the weights used in the final binary comparison.

You can get these weights on Windows using LCMapString. Notice that the figures are implementation dependent. Windows uses the following format:
Quote
[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00

Collate order defaults to natural sort order for at least these first 128 characters and defaults also to natural order of cp1252/iso-8859-1 for the next 128.. Nothing  reversed.
You are talking about POSIX locale. Any locale could change that order. According to the results on your system, and EganSolo's, your locales reversed the order, hence you both have this normal confusion.

Again, Windows CompareString, and wcstring's strcoll both are locale dependent. The locales on your systems are reversing the order of capital/small letters. Test my statement using CompareString/strcoll.

EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #42 on: August 18, 2017, 03:13:33 am »
JuhaManninen:
Quote
It is only in Lazarus 1.8. I guess you have 1.6.
You can download the needed files also from SVN, either trunk or fixes_1_8 branch, if you don't want to update Lazarus.
However I recommend Lazarus 1.8 RC4.

On it. Thanks for the tip.

By the way, as I wade my way through this unicode stuff, I'll try to write a beginner's summary to what I'm learning here. The links you provided me are making more and more sense but my impression is that they are not written for beginners  :).

I'll upgrade to 1.8, go through the LazUnicode unit and I'll post back here some of my own observations.


EganSolo

  • Sr. Member
  • ****
  • Posts: 290
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #43 on: August 18, 2017, 04:38:51 am »
Engkin, your wrote:
Quote
According to the results on your system, and EganSolo's, your locales reversed the order, hence you both have this normal confusion. Again, Windows CompareString, and wcstring's strcoll both are locale dependent. The locales on your systems are reversing the order of capital/small letters. Test my statement using CompareString/strcoll.

I wondered what the equivalent code would do under C#, so I wrote this bit of code https://goo.gl/zkWXpJ
on CodingGround http://www.tutorialspoint.com/compile_csharp_online.php and the String.Compare function in C# returns "a < B" as you'll see if you run that code.

MSDN https://social.msdn.microsoft.com/Search/en-US?query=string&pgArea=header&emptyWatermark=true&ac=2 states that for the .Net Framework version 4.0:
Quote
A string is a sequential collection of Unicode characters that is used to represent text. The value of the String object is the content of the sequential collection of characters

Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'. I'm now left with more questions than answer but I'll update to 1.8, study LazUnicode and come back for more later.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: A simple sane question to end insanity! TStringList in unicode mode
« Reply #44 on: August 18, 2017, 05:14:39 am »
Therefore, it would seem that under C# a unicode comparison of 'a' and 'B' returns 'a' < 'B'.

Thank you!

 

TinyPortal © 2005-2018