Recent

Author Topic: Sorting in a TStringList with UTF8 behaves differently in Linux and Windows  (Read 968 times)

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Hey Y'All,

While testing the generator app for the 1 Billion Rows Challenge for Object Pascal I've realised that the sorting of the TStringList does not match between Linux and Windows.

Is this a known issue or something no one has ever stumbled upon?

Cheers,
Gus
« Last Edit: March 01, 2024, 01:14:28 am by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

TRon

  • Hero Member
  • *****
  • Posts: 2504
Is this a known issue or something no one has ever stumbled upon?
What do you mean there is a difference ?

It sorts differently because it takes the local into account by default ?

It can be 'solved' using your own custom sort routine that matches for both platforms.


ASerge

  • Hero Member
  • *****
  • Posts: 2241
If these are not ASCII characters, then the sorting will depend on the locale, and it may also be different in Windows and Linux.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Hey ASerge,

If these are not ASCII characters, then the sorting will depend on the locale, and it may also be different in Windows and Linux.

Ok, this makes sense.

If you have a look at  the weather_stations.csv, it contains accentuated characters, in UTF8.
This will probably throw off the comparison function...

Apart from doing my own comparison function, is there a way to sort this out in a way that it only uses some property, or the likes?

I SO HATE UTF8 !!!! LOL!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

ASerge

  • Hero Member
  • *****
  • Posts: 2241
Apart from doing my own comparison function...
I didn't see how you were comparing.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Hey ASerge,

Apart from doing my own comparison function...
I didn't see how you were comparing.

Welp, that's the thing, I'm not doing my own comparison AT ALL!!

Code: Pascal  [Select][+][-]
  1. { Generator Constructor }
  2. FStationNames:= TStringList.Create;
  3. FStationNames.Duplicates:= dupIgnore;
  4. FStationNames.Sorted:= True;
  5.  
  6. { Insert a bunch of UTF8 and NON UTF8 Strings }
  7. { This code is inside a loop that reads from a TStreamReader, line by line }
  8. { var entry: String }
  9. entry:= entry.Split(';')[0];
  10. FStationNames.Add(entry);
  11. { End of loop }
  12.  
  13. { Generator Destructor }
  14. FStationNames.Free;

So, as you can see, I'm not doing anything, I'm just using whatever TStringList comes with to do the sorting.

And because of that, I think this should not have different behaviours depending on the OS.
This is core RTL, or FCL ( don't really know exactly ) and should not have different behaviours just because the OS is different.

At least this is my opinion, conveyed with humility and apologies if it angers anyone.

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Hey Y'All,

I was recently directed to a screen shot I made about the Lazarus 3.2 release, and completely forgot abut it's contents.

In it, it mentions that the TStringListUTF8 will be merged into the regular TStringList.

Before I created this thread I completely forgot abut that!!
And I think that if I used TStringListUTF8 to begin with, I would not see this different behaviour.

Since Lazarus 3.2 is not that mainstream, YET, I'll alter my code to use TStringListUTF8 in the meantime and see if this solves anything.

I'll revert it to TStringList in the future.

Many, MANY, thanks to ASerge and Tron for your input!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

alpine

  • Hero Member
  • *****
  • Posts: 1062
While testing the generator app for the 1 Billion Rows Challenge for Object Pascal I've realised that the sorting of the TStringList does not match between Linux and Windows.
BTW, What is a billion - giga or tera?   ::)
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

cdbc

  • Hero Member
  • *****
  • Posts: 1077
    • http://www.cdbc.dk
Hi
Giga
Regards Benny
If it ain't broke, don't fix it ;)
PCLinuxOS(rolling release) 64bit -> KDE5 -> FPC 3.2.2 -> Lazarus 2.2.6 up until Jan 2024 from then on it's: KDE5/QT5 -> FPC 3.3.1 -> Lazarus 3.0

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4467
  • I like bugs.
In it, it mentions that the TStringListUTF8 will be merged into the regular TStringList.
Before I created this thread I completely forgot abut that!!
And I think that if I used TStringListUTF8 to begin with, I would not see this different behaviour.
TStringListUTF8 does not exist in LazUtils or other Lazarus sources any more. It was needed a long time ago when FPC did not have dynamic codepages.
Now you get full Unicode support with TStringList.

LazUtils has TStringListUTF8Fast which is an optimization for use cases that most often have pure ASCII but can have non-ASCII sometimes.
Lazarus IDE has many such cases. Pascal identifiers, directory names for FPC and Lazarus sources etc.

TStringList has 2 properties that affect comparison and sorting, namely CaseSensitive and UseLocale.
By default CaseSensitive=False and UseLocale=True. Then the super-slow AnsiCompareText() is used for comparison.
It means TStringList is a potential trap, code can become very slow without a programmer realizing the reason.
Setting CaseSensitive:=True makes it already faster although you may still get different order in different locales.
The different order is not related to UTF8 encoding. It is related to the complex Unicode rules. WideStringManager on Windows uses the system's UTF16 encoding.


Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018