Recent

Author Topic: BUG? Can someone explain  (Read 20443 times)

derek.john.evans

  • Guest
Re: BUG? Can someone explain
« Reply #15 on: March 11, 2016, 02:24:03 pm »
No, this time you can blame Lazarus and its UTF-8 functions mapped to widestringmanager.
Now I understand the reason. It can be fixed by using UTF8UpperCase instead of UTF8LowerCase in UTF8CompareText. Isn't this correct? Does it cause any potential problems?


Causes major problems for code written. ie: It breaks all kinds of code.

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #16 on: March 11, 2016, 02:30:23 pm »
@Juha:

Using UtfUpperCase in Utf8CompareText just hides the problem.
Geepster seems to expect that comparing 'apple' with '_apple' must give the same result as comparing 'APPLE' with '_APPLE' using a sase sensitive algorithm.

While on Windows, with a "Western" locale this is the case, It cannot be guaranteed to be so on all platforms on all locales, even in a plain fpc program.

Both CompareStr, Utf8CompareStr and Utf8CompareText use CompareMemRange, and this will give different results in these cases, since it will compare '_' against either 'A' or 'a'.

If you want case-insensistve results use a case-insensitive alogorithm.
Even then the result may be different on different platforms, thus making sorting things (and I guess a binary tree is sorted?) complicated.

So, whichever way we go, as long as we use CompareMemRange, the result may not be what a human would expect, since we just compare byte values, and have no intrinsic knowledge about language sorting orders.

Another approach would be to query the underlying OS (like AnsiCompareStr does), but even then I think Geepster has a flaw in his reasoning, and his problem will not be resolved.
However we can do so by using WideCompare functions (LazUtf8 does not override these for the widestringmanager), and at least on Windows this then works as Geepster expects it to be (but IMO still by shear luck).
I'm unaware of the WideCompare functions implementation on other platforms (Linux, OSX), so I can't comment on that at all.
(I do know that we then need cwstring on *nix, or fpwidestring (slower but consistent across all platforms)).

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #17 on: March 11, 2016, 02:34:52 pm »
Causes major problems for code written. ie: It breaks all kinds of code.

I think your reasoning is wrong.
There is no reason to expect that comparing 'APPLE' with '_APPLE' should give the same result as comapring 'apple' with '_apple" when using a case sensitive comparing algorithm.

Use a case-insensitive algorithm and (for that algorithm) you will get consistent results.

Bart

derek.john.evans

  • Guest
Re: BUG? Can someone explain
« Reply #18 on: March 11, 2016, 02:40:15 pm »
Yea, good luck with that!

Hence why I left this forum. Total idiots

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #19 on: March 11, 2016, 03:03:15 pm »
Total idiots

That's uncalled for.
I tried to explain why your reasoning is wrong in my opinion.
I never called you names, and I would prefer if you would refrain from that as well.
It is perfectly OK to disagree on this topic.

Obviously my reasoning can be wrong as well.
If you think so, then please explain to me why that is the case.
Especially why you would think that comparing 'APPLE' with '_APPLE' should give the same result as comparing 'apple' with '_apple" when using a case sensitive comparing algorithm.
Because AFAICS that is what you expect.

I pointed out to you that the simple fact that this used to be true on Windows, does not mean that this would necessarily be the case on any platform on any locale.

I also gave some alternatives which might help in your particular case.

You never responded to these suggestions.
Don't they work for you?
If so, why not?

Neither did you respond to me questioning your reasoning.
You only keep repeating that in fact it is a bug in either fpc or Lazarus.

Bart

derek.john.evans

  • Guest
Re: BUG? Can someone explain
« Reply #20 on: March 11, 2016, 03:11:35 pm »
And, yet, the current situation breaks code.

Im 100% sure this will be fixed. When? Hopefully soon

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #21 on: March 11, 2016, 03:21:28 pm »
Many companies, like Microsoft, never fix underlying bugs, because it would break the workarounds the developers created.

That's also why they generally don't improve things, but freeze them and build something completely new and different instead.

That way, compatibility is guaranteed.


Then again, if you do that, we, the developers, have to start form scratch on each major update.


While I would have preferred full 32-bit unicode characters, I am very happy with the move to UTF8. And I accept that it changes the behavior of some things, and might even break others.

Because, in the long run, it makes my work easier.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4715
  • I like bugs.
Re: BUG? Can someone explain
« Reply #22 on: March 11, 2016, 03:25:42 pm »
Using UtfUpperCase in Utf8CompareText just hides the problem.
Geepster seems to expect that comparing 'apple' with '_apple' must give the same result as comparing 'APPLE' with '_APPLE' using a sase sensitive algorithm.

Yes but we have 2 separate issues now:

1. Dealing with '_' in a case sensitive comparison. I agree with your reasoning. It is the only way to go when the numeric values of ASCII characters are compared.
BTW, this does not depend on platform or locale. The pure 7-bit ASCII is the same everywhere. Actually any UTF-8 comparison should not depend on platform or locale. That is the idea of Unicode. (Or maybe this is another detail of Unicode that I have misunderstood).

2. Comparing the same pure 7-bit ASCII characters with case in-sensitive CompareText and Utf8CompareText. They should give the same result. If they don't, I consider it a bug which must be fixed.
If UTF8CompareText uses UTF8UpperCase, what does it break? I don't think it breaks anything, it only fixes a bug and makes things consistent.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4715
  • I like bugs.
Re: BUG? Can someone explain
« Reply #23 on: March 11, 2016, 03:58:18 pm »
Maybe you can use WideCompareStr() instead? At least on Windows, it gives this results:
Code: [Select]
WideCompareStr(APPLE,_APPLE)  = 1
WideCompareStr(apple,_apple)  = 1
WideCompareText(APPLE,_APPLE) = 1
WideCompareText(apple,_apple) = 1

It calls Windows API. What is the logic there? It does not only compare numeric values of WideChars, does it?
I guess AnsiCompareStr() / AnsiCompareText() in Delphi, using UTF-16, does the same thing. It means our system is not Delphi compatible but it is technically more correct when you think of the order of ASCII characters.
IMO the only bug is the CompareText <-> Utf8CompareText thing.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4715
  • I like bugs.
Re: BUG? Can someone explain
« Reply #24 on: March 11, 2016, 04:00:46 pm »
... About using UTF8UpperCase instead of UTF8LowerCase in UTF8CompareText ...

Causes major problems for code written. ie: It breaks all kinds of code.

How exactly?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #25 on: March 11, 2016, 04:16:19 pm »
2. Comparing the same pure 7-bit ASCII characters with case in-sensitive CompareText and Utf8CompareText. They should give the same result. If they don't, I consider it a bug which must be fixed.
If UTF8CompareText uses UTF8UpperCase, what does it break? I don't think it breaks anything, it only fixes a bug and makes things consistent.

Note that AnsiCompareText/WideCompareText and CompareText also disagree about this (plain fpc, no LazUtf8):
Code: [Select]
AnsiCompareText(APPLE,_APPLE) = 1
WideCompareText(APPLE,_APPLE) = 1  //note: tested with and without fpwidestring
CompareText(APPLE,_APPLE)     = -30

So, which way to go then with Utf8CompareText?
Currently in this particular circumstance it behaves like AnsiCompareText and WideCompareText (the latter one being at least sort of Unicode): Result > 0.
So, why would changing this make Utf8CompareText more consistent?
Making it behave like WideCompare functions makes more sense to me (since LCL aims to be UTF8 and therefor fully Unicode enabled w.r.g. to strings).

The pure 7-bit ASCII is the same everywhere.

IIRC than in Turkish on of the lower ascii characters has quite a different meaning than the "corresponding lowercase one" (offset by Ord('A') - Ord('a') is what I mean here). And probably a different sort order.

Bart
« Last Edit: March 11, 2016, 04:23:38 pm by Bart »

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #26 on: March 11, 2016, 04:21:59 pm »
It calls Windows API. What is the logic there? It does not only compare numeric values of WideChars, does it?
I guess AnsiCompareStr() / AnsiCompareText() in Delphi, using UTF-16, does the same thing. It means our system is not Delphi compatible but it is technically more correct when you think of the order of ASCII characters.

The OS calls take the current locale into consideration (so does fpwidestring).

IMO the only bug is the CompareText <-> Utf8CompareText thing.

I disagree, see my post above.
IMO Utf8CompareText should behave like WideCompareText (with regard to the sign of the result).

Bart

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4715
  • I like bugs.
Re: BUG? Can someone explain
« Reply #27 on: March 11, 2016, 04:34:13 pm »
IMO Utf8CompareText should behave like WideCompareText (with regard to the sign of the result).

Uhhh, this is complex.
I didn't know a locale changes also 7-bit ASCII sorting order. How to get the logic for comparison then? I have no idea.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #28 on: March 11, 2016, 04:38:33 pm »
I don't care how they behave exactly, as long as they give the correct result for UTF8, as that is now the standard string format. In which, the underscore is in between uppercase and lowercase letters. Which requires giving a different result than with Ascii, where the underscore comes before both uppercase and lowercase letters.

It's different, get used to it. If you use a standard, you have to follow it completely. Go all the way.

And there's always the compiler option to use Ansi strings, if you don't like it.

Bart

  • Hero Member
  • *****
  • Posts: 5731
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #29 on: March 11, 2016, 04:41:12 pm »
Uhhh, this is complex.
I didn't know a locale changes also 7-bit ASCII sorting order. How to get the logic for comparison then? I have no idea.

Mind you, I'm not absolutely sure about this.
I read it somewhere.

But you're right.
Getting sorting order right can be very difficult.
That's why it would be best (IMO) to leave that to the OS/WS if at all possible.
Changes are that implementations ill have less bugs than our own (even if it's M$).

Bart

 

TinyPortal © 2005-2018