Recent

Author Topic: BUG? Can someone explain  (Read 19777 times)

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #30 on: March 11, 2016, 04:50:51 pm »
There is 7-bit Ascii, which should always be the same, there is 8-bit Ascii, which uses code pages, which are all different and all have their own sort order, and there is Unicode.

Unicode defines 32-bit characters. But because many people see that as wasting memory, they came up with a clever way to implement it, that allows packing it in a variable-length format, like UTF8 or UTF16. FPC/Lazarus (and HTTP) use UTF8, Windows uses a "custom", fixed 16-bit UTF16.

I don't want to convert my 32-bit Unicode into 16-bit Microsoft Unicode and lose information, compare it and then convert it back to 32-bit Unicode.

BeniBela

  • Hero Member
  • *****
  • Posts: 947
    • homepage
Re: BUG? Can someone explain
« Reply #31 on: March 11, 2016, 04:55:03 pm »
The only reliable way is to write your own functions for everything

The standard functions change their behavior whenever you expect it the least

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #32 on: March 11, 2016, 05:11:03 pm »
It would be best if there was some library that included the exact lexical Unicode sort order for each locale, but as long as it doesn't exists, I prefer consistency as well.


For an explanation, go here.

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #33 on: March 11, 2016, 05:16:27 pm »
... In which, the underscore is in between uppercase and lowercase letters. Which requires giving a different result than with Ascii, where the underscore comes before both uppercase and lowercase letters.

No, the underscore in UTF8 still is between 'A' and  'a'.
The problem is that CompareText essentially does a comparison between UpCase(Chr1) and Upcase(Chr2) and returns Ord(Chr1)-Ord(Chr2) as a function result when these two differ.
This is a remnant from the old TP days.

Taking into account the locale should probably mean that '_' is considered to be lower than 'a' and 'A' both.
Also generally 'a' is considered to be lower than 'A'.
CompareMemRange does not do that obviously.

Code: [Select]
CompareStr(A,a)     = -32
Utf8CompareStr(A,a) = -32
AnsiCompareStr(A,a) = 1 //plain fpc
WideCompareStr(A,a) = 1

So Geepster may have a valid point w.r.g. this behaviour of Utf8CompareStr()(even though he shoud not have relied on it  >:D).

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #34 on: March 11, 2016, 05:17:46 pm »
It would be best if there was some library that included the exact lexical Unicode sort order for each locale, but as long as it doesn't exists, I prefer consistency as well.

I think the fpwidestring aims to do that.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #35 on: March 11, 2016, 05:23:19 pm »
Unicode defines 32-bit characters.

Wherever did you get that idea from?
Unicode defines more than one form of encoding unicode points.
The best known (?) are UTF8, UTF16LE/UTF16BE (which if I understand correctly cannot hold every codepoint) and indeed also UTF32.

Utf8 is (for Western Eueopean languages) probably the most efficient one.
And it also has the quality that a zero character will mean end of string (unless you're in pascal), which is a bonus for transmissio protocols.

Bart

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #36 on: March 11, 2016, 05:36:09 pm »
Ok, technically Unicode specifies 22 bits worth of planes, and defines some formats with some default mappings. But unless you use, for example, a fixed 16-bit format, you can represent all of those characters.

It might take 6 bytes in UTF8 for some, but you can use them. And if you want a format where you can use them all and they all have the same length, you have to use UTF32.

We have become used to lexical sorting with the code pages and the Windows implementations of them. But without a comprehensive library like fpwidestring that even understands all European locales, we have to improvise.

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #37 on: March 11, 2016, 07:45:52 pm »
Some more comparisons Windows vs Linux, plain fpc vs Lazarus:

Code: [Select]
------ Plain Fpc ------
 Windows                                     Linux
 AnsiCompareStr(APPLE,_APPLE)  = 1           -1
 CompareStr(APPLE,_APPLE)      = -30         30
 WideCompareStr(APPLE,_APPLE)  = 1           -1

 AnsiCompareStr(apple,_apple)  = 1           -1
 CompareStr(apple,_apple)      = 2           2
 WideCompareStr(apple,_apple)  = 1           -1

 AnsiCompareText(APPLE,_APPLE) = 1           -1
 CompareText(APPLE,_APPLE)     = -30         30
 WideCompareText(APPLE,_APPLE) = 1           -1

 AnsiCompareText(apple,_apple) = 1           -1
 CompareText(apple,_apple)     = -30         30
 WideCompareText(apple,_apple) = 1           -1

 AnsiCompareStr(A,a) = 1                     7
 CompareStr(A,a)     = -32                   -32
 WideCompareStr(A,a) = 1                     7



------ Lazarus -------
 Windows                                     Linux
 Utf8CompareStr(APPLE,_APPLE)  = -30         -30
 AnsiCompareStr(APPLE,_APPLE)  = -30         -1
 CompareStr(APPLE,_APPLE)      = -30         -30
 WideCompareStr(APPLE,_APPLE)  = 1           -1

 Utf8CompareStr(apple,_apple)  = 2           2
 AnsiCompareStr(apple,_apple)  = 2           -1
 CompareStr(apple,_apple)      = 2           2
 WideCompareStr(apple,_apple)  = 1           -1

 Utf8CompareText(APPLE,_APPLE) = 2           2
 AnsiCompareText(APPLE,_APPLE) = 2           -1
 CompareText(APPLE,_APPLE)     = -30         -30
 WideCompareText(APPLE,_APPLE) = 1           -1

 Utf8CompareText(apple,_apple) = 2            2
 AnsiCompareText(apple,_apple) = 2            -1
 CompareText(apple,_apple)     = -30          -30
 WideCompareText(apple,_apple) = 1            -1

 Utf8CompareStr(A,a) = -32                    -32
 AnsiCompareStr(A,a) = -32                    7
 CompareStr(A,a)     = -32                    -32
 WideCompareStr(A,a) = 1                      7     

Notice that under Linux, LazUtf8 does not "temper with" the widestringmanager (it uses cwstring).

The example deonstrates that it is wrong (TM) to rely on xxxCompare* functions to return a specific sign.
You can only rely on the fact that it will return the same sign on the same OS/WS.

Furthermore AnsiCompare* and WideCompare* have the same sign (plain fpc) and IMO Utf8Compare* should have the same sign as well.

Bart

derek.john.evans

  • Guest
Re: BUG? Can someone explain
« Reply #38 on: March 11, 2016, 08:46:39 pm »
Anyway. I'm sorry for being a grumpy f-er. Sorry for any un-called for comments. Im just a little WTF atm.

This stuff messes with my head. Character sorting order is a corner stone of a programmers life!

I'm ok with changing the order, but, when you have a sorting algorithm using one standard, then, a binary search using another.....

Well, thats just a kaos.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: BUG? Can someone explain
« Reply #39 on: March 11, 2016, 10:18:49 pm »
It might take 6 bytes in UTF8 for some, but you can use them. And if you want a format where you can use them all and they all have the same length, you have to use UTF32.

Encodings deal with codepoints. Unfortunately UTF-32 does not help with the real complications of Unicode, namely the decomposed accented characters. They are beyond codepoints.
UTF-32 is fixed width, true, but in practice it slows your code down due to bigger memory usage + cache misses etc.

Variable width encoding is not as bad as it first sounds because typical code only seldom studies individual codepoints beyond ASCII area.
It also means that the UTF-8 system now in Lazarus is amazingly Delphi compatible in practice, except for this unfortunate sort order difference discovered now.
I already knew that the accented characters are difficult to sort because they can be represented in different ways (by 1 or 2 codepoints for the same accented char).
I did not know that even basic ASCII can be difficult to sort.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #40 on: March 11, 2016, 11:22:31 pm »
Back to topic.

Currently 2 possible solutions have been proposed:
  • Use Utf8UpperCase in Utf8CompareText
  • Use WideCompare* functions

Solution 1 will not fix the problem that it will not treat 'a' < 'A' etc.
Also to me it is unclear that for every 2 or 3-byte Utf8 codepoint the UpperCase will be > LowerCase when just comparing bytes (may especially be a problem if Upper- and LoweCase have a different number of bytes).

Solution 2 may have problems on non-Windows platforms, and there is penalty for extra overhead (converting to WideString/UnicodeString).
It will make Utf8Compare* have the same sign as original AnsiCompare* and WideCompare* functions.
This would still leave the "problem" that the (sign of the) result is not the same on Windows and Linux.
Using fpwidestring unit may solve that, but it seems that this unit must be the first in the programs uses clause in order to wok correctly?

Solution 3 would be to completely rewrite Utf8CompareStr() to be locale-aware.
That would however mean double work (in fact you'ld be "forking" fpwidestring unit).

There's also one other thing to consider.
Setting WidestringManager routines to Utf8* functions in LazUtf8 should maybe be disabled when DisableUtf8RTL is defined?

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4631
  • I like bugs.
Re: BUG? Can someone explain
« Reply #41 on: March 12, 2016, 12:07:37 am »
There's also one other thing to consider.
Setting WidestringManager routines to Utf8* functions in LazUtf8 should maybe be disabled when DisableUtf8RTL is defined?

It is disabled. That is the whole point of having DisableUtf8RTL.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5612
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #42 on: March 12, 2016, 12:14:29 am »
It is disabled. That is the whole point of having DisableUtf8RTL.

That's what you get when you're too lazy to look it up :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[  :-[

Bart

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #43 on: March 13, 2016, 02:26:06 am »
Bart, I agree.

Thaddy

  • Hero Member
  • *****
  • Posts: 18306
  • Here stood a man who saw the Elbe and jumped it.
Re: BUG? Can someone explain
« Reply #44 on: March 13, 2016, 11:46:01 am »
Well, I agree, but even to my standards Bart went over the top a little in using old school expressiveness.
Therefor I would end with  :'( >:D
Due to censorship, I changed this to "Nelly the Elephant". Keeps the message clear.

 

TinyPortal © 2005-2018