Recent

Author Topic: naturalstrcmp function  (Read 28967 times)

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #90 on: May 16, 2015, 10:19:09 pm »
Do we have a official source for the definition of Natural sort?

Yes, I think: the SouceForge's NaturalSortUnit. It is immediately  updated, every time we change it.

http://sourceforge.net/projects/lazarusfiles/files/naturalsort.zip/download

I think that the Windows function's bug is about differencing the numbers by its length, instead of the value, when the value is the same, which occurs only for zeroes, BTW.
« Last Edit: May 17, 2015, 12:07:55 am by typo »

rvk

  • Hero Member
  • *****
  • Posts: 3842
Re: naturalstrcmp function
« Reply #91 on: May 17, 2015, 12:10:02 am »
I think that the Windows function's bug is about differencing the numbers by its length, instead of the value, when the value is the same, which occurs only for zeroes, BTW.
No, I don't think Windows looks at the length of the string in the case the numbers are equal. It looks at the individual numbers at their position.

And of course it only happens with the zero's before numbers. Otherwise the numbers wouldn't be the same:
Code: [Select]
111
11
1
These are not the same. Only if the (same) numbers are prefixed with zero they could be the same.

I still think there is an argument for the way Windows sorts them.
Like I said:
Code: [Select]
001
01
1
^^
||-----look at the 0, 1
--------[b]look at the 0, 0, 1[/b]
And I think Microsoft sees that when these 3 numbers are in value equal (which they are) they look at the individual numbers themselves at their respective position and in that case the order would be correct.

Code: [Select]
1
01
001
This would be wrong in that case. Because the 1 is before the 0.

I haven't found any real official document describing this kind of sorting (but will continue looking). Maybe there isn't any and it's just "open" to interpretation.

I'm also still hopeful to get that naturalstrcmp going. (haven't tried passing a UCS4 to that one yet)

B.T.W. I think Linux also sorts the filenames the way the Windows-sort works:
Code: [Select]
-rw-r--r--  1 root root 0 May 17 00:17 001
-rw-r--r--  1 root root 0 May 17 00:17 01
-rw-r--r--  1 root root 0 May 17 00:17 1
-rw-r--r--  1 root root 0 May 17 00:17 11

I also have found this definition:

Quote
Natural Order uses the following rule to sort strings:

  • The space character sorts before numbers; numbers sort before non-numbers.
  • Numbers sort in numerical order; leading zeroes and spaces on numbers are ignored, except as a tie-breaker for numbers that have the same numerical value.
  • Non-numbers sort in the Mac's System sorting order.
So in case of a tie braker (numbers with same numerical value) you need to look at the leading zeros and spaces. And in that case a zero becomes before the 1.

So... maybe this isn't a bug at all.
« Last Edit: May 17, 2015, 12:26:03 am by rvk »

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #92 on: May 17, 2015, 12:28:32 am »
Our function avoids this issue this way:

Code: [Select]
        Num1 := GetNumber(pStr1, Len1);
        Num2 := GetNumber(pStr2, Len2);
        if Num1 < Num2 then
          Result := -1
        else if Num1 > Num2 then
          Result := 1
        else
        begin
          if Len1 < Len2 then
            Result := -1
          else if Len1 > Len2 then
            Result := 1;
        end;           

rvk

  • Hero Member
  • *****
  • Posts: 3842
Re: naturalstrcmp function
« Reply #93 on: May 17, 2015, 12:34:40 am »
Our function avoids this issue this way:
Well... it's not actually "avoiding" the issue.

If, looking at the arguments in my previous post (I added some extra arguments), the ordering in Windows is considered correct (because of the definition) then the function should be:
Code: [Select]
          if Len1 < Len2 then
            Result := 1
          else if Len1 > Len2 then
            Result := -1;

So it all comes down to the fact... what is correct.

Linux ls option -v also orders natural like this:
Code: [Select]
# ls -lv
total 0
-rw-r--r--  1 root root 0 May 17 00:17 001
-rw-r--r--  1 root root 0 May 17 00:17 01
-rw-r--r--  1 root root 0 May 17 00:17 1
-rw-r--r--  1 root root 0 May 17 00:17 2
-rw-r--r--  1 root root 0 May 17 00:17 11

And sort -n also does this:
Code: [Select]
# ls | sort -n
001
01
1
2
11
« Last Edit: May 17, 2015, 12:49:33 am by rvk »

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #94 on: May 17, 2015, 12:53:52 am »
Hard question.

In my opinion, natural sort refers to human sort, grammatically and alphabetically correct.

If you ask a computer user to sort a list of words, he/she will sort:

01
001
0001

and not

0001
001
01

It is intuitive. Language dictionaries would do so the same.

Perhaps it is not a bug, since it can be by design, but is not the human way of sorting strings.

If it is necessary to do a choice, I choose the human, intuitive way, since computers did not invent sorting strings task.
« Last Edit: May 17, 2015, 01:14:15 am by typo »

rvk

  • Hero Member
  • *****
  • Posts: 3842
Re: naturalstrcmp function
« Reply #95 on: May 17, 2015, 01:12:47 am »
If you ask a computer user to sort a list of words, he/she will sort:
01
001
0001
Yes. That's also true. Instinctively I would also choose this ordering.

I guess Linux/Windows native ordering differs on that point (as not being intuitive).

One possibility could be a global switch to use the same ordering in the function as is done with the Windows-api (and Linux-commands) but a clear description on the difference and why there is a difference would also be sufficient. Maybe calling it a "bug" would be a bit much (because they follow another set of rules. Calling it "ordering not according to dictionary-rules and non-intuitive" would be more diplomatic :).

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #96 on: May 17, 2015, 01:15:53 am »
I changed the comment of the function to this:

Code: [Select]
{

NaturalSort sorts UTF-8 strings in a natural collated order, so numbers are sorted too.
 By design it sorts like this:

 01
 001
 0001

 and

 0
 00
 000
 000_A
 000_B

 in a human, intuitive order.

 }       
« Last Edit: May 17, 2015, 01:17:48 am by typo »

rvk

  • Hero Member
  • *****
  • Posts: 3842
Re: naturalstrcmp function
« Reply #97 on: May 17, 2015, 01:17:40 am »
I change the comment of the function to this:
Perfect :)

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #98 on: May 17, 2015, 02:06:05 am »
Code: [Select]
{$IFDEF HUMAN_SORT}
  {Uncomment Human_Sort define just at the beggining of implementation section
   to use the human, intuitive sort way. Otherwise use Windows API function.
   Who knows how to sort a list of strings, any alphabetically competent person
   or programmers? The choice is yours.}
    Result := NaturalStringCompare(aList[Index1], aList[Index2]);
  {$ELSE}
    {$IFDEF WINDOWS}
      Result := StrCmpLogicalW(PWideChar(UTF8Decode(aList[Index1])), PWideChar(UTF8Decode(aList[Index2])));
    {$ELSE}
      Result := NaturalStringCompare(aList[Index1], aList[Index2]);
    {$ENDIF}
  {$ENDIF}

Also, I think it is a good idea to use the NORM_LINGUISTIC_CASING flag in CompareStringW:

Code: [Select]
Result := CompareStringW(LOCALE_USER_DEFAULT, NORM_LINGUISTIC_CASING,
            pWideChar(UTF8Decode(TextStr1)), Length(TextStr1),
            pWideChar(UTF8Decode(TextStr2)), Length(TextStr2)) - 2; 
« Last Edit: May 17, 2015, 11:13:08 am by typo »

rvk

  • Hero Member
  • *****
  • Posts: 3842
Re: naturalstrcmp function
« Reply #99 on: May 17, 2015, 11:41:52 am »
Now that the switch for HUMAN_SORT is build in and it is default (as it should) Windows uses this function correctly. However, if someone wants the Windows/Linux directory ordering, the HUMAN_SORT-define can be commented out but then it only works for Windows but not for Linux.

My suggestion is to do this:
Code: [Select]
          {$IFDEF HUMAN_SORT}
          if Len1 < Len2 then
            Result := -1
          else if Len1 > Len2 then
            Result := 1;
          {$ELSE}
          if Len1 < Len2 then
            Result := 1
          else if Len1 > Len2 then
            Result := -1;
          {$ENDIF}

In that case, when you comment out HUMAN_SORT on Linux it works the exact same way as on Windows (and as sort -n and ls -lv on Linux).

NORM_LINGUISTIC_CASING might be a good option. Otherwise, for Turkish and Azeri, the i is mapped to I.

I did notice something else in my testing. StrCmpLogicalW places the / before the numbers. But CompareStringW places it between the numbers and letters. Another weird thing about StrCmpLogicalW?

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #100 on: May 17, 2015, 01:35:01 pm »
My suggestion is to do this:

Done.

Another weird thing about StrCmpLogicalW?

No, AFAIK.

I thought StrCmpLogicalW used CompareStringW internally for alpha strings as we do, but it seems that it is not the case.

Linux manual page says nothing about casing in wcscoll:

http://linux.die.net/man/3/wcscoll

I assume it is locale-aware casing. So NORM_LINGUISTIC_CASING flag in CompareStringW is coherent to this.

Attached the test project I am using, maybe it can be useful.
« Last Edit: May 17, 2015, 06:06:49 pm by typo »

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: naturalstrcmp function
« Reply #101 on: May 18, 2015, 11:19:26 am »
I have searched about UTF8ToUCS4String routine, but it seems that the perfomance increase is not significant.

I even found this routine which does the same conversion:

Code: [Select]
function UTF8ToUCS4String(const str: UTF8String): UCS4String;
begin
  Result := WideStringToUCS4String(UTF8Decode(str));
end;

http://bugs.freepascal.org/view.php?id=28126
« Last Edit: May 19, 2015, 04:07:50 pm by typo »