•    Free Pascal
• Website
• Downloads
• Wiki
• Bugtracker
• Mailing List
•    Lazarus
• Website
• Downloads (Laz+FPC)
• Packages (OPM)
• FAQ
• Wiki
• Bugtracker
• IRC channel
• Developer Blog
• Follow us on Twitter
• Latest SVN
• Mailing List
• Other languages
•    Foundation
• Website
• Project Roadmap
• Getting the Source
• Screenshots

### Bookstore Computer Math and Games in Pascal (preview) Lazarus Handbook (preview only)

### Author Topic: naturalstrcmp function  (Read 28973 times)

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #30 on: May 12, 2015, 09:09:07 pm »
Code: [Select]
`1b2a3 a20 aaãáàâäbcçdpic1pic1a2pic1a3pic2pic3pic10pic20`
But it needs to work in any language, according to the system language.

I need a function that works in any language, but could not test it there.

UTF8CompareStrCollated seems to work for me.

This is the function I am testing:

Code: [Select]
`Function NaturalCompareStr(Str1,Str2: string):Integer;var Num1,Num2:Double;    pStr1,pStr2:PChar;    Len1,Len2:Integer;    utf8char1 :String;    utf8char2 :String;    charlen1 :Integer;    charlen2 :Integer;  Function IsNumber(ch:Char):Boolean;  begin     Result:=ch in ['0'..'9'];  end;  Function GetNumber(var pch:PChar;var Len:Integer):Double;    var FoundPeriod:Boolean;        Count:Integer;  begin     FoundPeriod:=False;     Result:=0;     While (pch^<>#0) and        (IsNumber(pch^) or ((not FoundPeriod) and (pch^='.'))) do     begin        if pch^='.' then        begin          FoundPeriod:=True;          Count:=0;        end        else        begin           if FoundPeriod then           begin             Inc(Count);             Result:=Result+(ord(pch^)-ord('0'))*Power(10,-Count);           end           else Result:=Result*10+ord(pch^)-ord('0');        end;        inc(Len);        Inc(pch);     end;  end;begin  if (Str1<>'') and (Str2<>'') then  begin    pStr1:=@Str1; pStr2:=@Str2;    Result:=0;    While not ((pStr1^=#0) or (pStr2^=#0)) do    begin       Len1:=0; Len2:=0;       while (pStr1^=' ') do begin Inc(pStr1); Inc(Len1) end;       while (pStr2^=' ') do begin Inc(pStr2); Inc(Len2) end;       if IsNumber(pStr1^) and IsNumber(pStr2^) then       begin          Num1 := GetNumber(pStr1, Len1);          Num2 := GetNumber(pStr2, Len2);          if Num1 < Num2 then Result := -1          else if Num1 > Num2 then Result := 1          else begin             if Len1 < Len2 then Result := -1             else if Len1 > Len2 then Result := 1;          end;          Dec(pStr1);          Dec(pStr2);       end       else       begin         Charlen1 := UTF8CharacterLength(pStr1);         Charlen2 := UTF8CharacterLength(pStr2);         UTF8Char1 := Copy(pstr1, 1, charlen1);         UTF8Char2 := Copy(pstr2, 1, charlen2);         Result := UTF8CompareStrCollated(UTF8Char1, UTF8Char2);       end;       if Result<>0 then Break;       Inc(pStr1); Inc(pStr2);    end;  end;  Num1 := length(Str1);  Num2 := length(Str2);  if (Result = 0) and (Num1 <> Num2) then  begin       if Num1<Num2 then Result:=-1 else Result:=1;  end;end;              `
This function also avoids StrCmpLogicalW bug with zeroes. But it is slow.
« Last Edit: May 12, 2015, 09:43:48 pm by typo »

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #31 on: May 13, 2015, 12:03:10 am »
At least for Linux this seems to be the best choice.

#### rvk ##### Re: naturalstrcmp function
« Reply #32 on: May 13, 2015, 09:27:12 am »
What Locale are you using in Debian?
With pt_PT.UTF-8 I got this:
Code: [Select]
`1b2a3 a20 aaáàâäãbcçdpic1pic1a2pic1a3pic2pic3pic10pic20`(slightly different order than you gave as desired ordering)

Or is everything (in Windows and Debian) the way you want now?

If speed is still an issue it should be checked if the code should be optimized or if the performance problem lies in UTF8CompareStrCollated. Although I don't think UTF8CompareStrCollated itself can be optimized that much because it just calls the appropriate api. You could of course look at some other/better sorting method.

(I always like these videos for sorting inspiration )

Edit:
I'm looking at your function and it could be optimized (a lot). As far as I can see you compare the alphabetical part letter for letter. That means for every letter the UTF8CompareStrCollated is called (which is very, very time consuming). You need to limit the times UTF8CompareStrCollated is called. So you need to gather all the letters which are packed together and call UTF8CompareStrCollated just once to compare. (That way the api-call speed-performance is decreased by a factor of the number of alphabetical characters in the string. That's why I initially suggested splitting the string into parts and comparing the parts afterwards (or at least change the UTF8CharacterLength-lines to scan for more characters to compare).

« Last Edit: May 13, 2015, 09:38:45 am by rvk »

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #33 on: May 13, 2015, 02:31:18 pm »
For now it is so:

Code: [Select]
`function NaturalCompareStr(Str1,Str2: string):Integer;var Num1,Num2:Double;    pStr1,pStr2:PChar;    Len1,Len2:Integer;    TextLen1, TextLen2 :integer;    TextStr1 :string = '';    TextStr2 :string = '';    i :Integer;    j :Integer;  Function IsNumber(ch:Char):Boolean;  begin     Result:=ch in ['0'..'9'];  end;  Function GetNumber(var pch:PChar;var Len:Integer):Double;    var FoundPeriod:Boolean;        Count:Integer;  begin     FoundPeriod:=False;     Result:=0;     While (pch^<>#0) and        (IsNumber(pch^) or ((not FoundPeriod) and (pch^='.'))) do     begin        if pch^='.' then        begin          FoundPeriod:=True;          Count:=0;        end        else        begin           if FoundPeriod then           begin             Inc(Count);             Result:=Result+(ord(pch^)-ord('0'))*Power(10,-Count);           end           else Result:=Result*10+ord(pch^)-ord('0');        end;        inc(Len);        Inc(pch);     end;  end;  procedure GetChars;  begin    TextLen1 := 0;    while not((pStr1 + TextLen1)^ in ['0'..'9']) and ((pStr1 + TextLen1)^ <> #0) do      Inc(TextLen1);    SetLength(TextStr1, TextLen1);    i := 1; j := 0;    while i <= TextLen1 do    begin      TextStr1[i] := (pStr1 + j)^;      Inc(i); Inc(j);    end;    TextLen2 := 0;    while not((pStr2 + TextLen2)^ in ['0'..'9']) and ((pStr2 + TextLen2)^ <> #0) do      Inc(TextLen2);    SetLength(TextStr2, TextLen2);    i := 1; j := 0;    while i <= TextLen2 do    begin      TextStr2[i] := (pStr2 + j)^;      Inc(i); Inc(j);    end;  end;begin  if (Str1<>'') and (Str2<>'') then  begin    pStr1:=@Str1; pStr2:=@Str2;    Result:=0;    While not ((pStr1^=#0) or (pStr2^=#0)) do    begin      TextLen1 := 1;      TextLen2 := 1;      Len1:=0; Len2:=0;      while (pStr1^=' ') do begin Inc(pStr1); Inc(Len1) end;      while (pStr2^=' ') do begin Inc(pStr2); Inc(Len2) end;      if IsNumber(pStr1^) and IsNumber(pStr2^) then      begin        Num1 := GetNumber(pStr1, Len1);        Num2 := GetNumber(pStr2, Len2);        if Num1 < Num2 then Result := -1        else if Num1 > Num2 then Result := 1        else begin           if Len1 < Len2 then Result := -1           else if Len1 > Len2 then Result := 1;        end;        Dec(pStr1);        Dec(pStr2);      end      else      begin        GetChars;        Result := WideCompareText(UTF8Decode(TextStr1),UTF8Decode(TextStr2));      end;      if Result <> 0 then Break;      Inc(pStr1, TextLen1);      Inc(pStr2, TextLen2);    end;  end;  Num1 := Length(Str1);  Num2 := Length(Str2);  if (Result = 0) and (Num1 <> Num2) then  begin       if Num1 < Num2 then         Result := -1       else         Result := 1;  end;end;        `
Still slow. 3 times slower than Windows one.
« Last Edit: May 13, 2015, 02:45:03 pm by typo »

#### rvk ##### Re: naturalstrcmp function
« Reply #34 on: May 13, 2015, 04:47:10 pm »
Still slow. 3 times slower than Windows one.
I just tested the difference between NaturalCompareStr, UTF8CompareStrCollated and StrCmpLogicalW. It came to this:

20000000 iterations:
NaturalCompareStr: 26,22
UTF8CompareStrCollated: 5,94 (just string comparison)
StrCmpLogicalW: 4,57

So the difference between UTF8CompareStrCollated and StrCmpLogicalW is not that big. I did notice a steep time-increase when you make the strings bigger.

I did it like this, initially:
Code: [Select]
`      begin        Charlen1 := 0;        Charlen2 := 0;        qStr1 := pStr1;        qStr2 := pStr2;        repeat          Charlen1 := Charlen1 + UTF8CharacterLength(qStr1);          Charlen2 := Charlen2 + UTF8CharacterLength(qStr2);          Inc(qStr1);          Inc(qStr2);        until (qStr1^ = #0) or (qStr2^ = #0) or IsNumber(qStr1^) or IsNumber(qStr2^);        UTF8Char1 := Copy(pstr1, 1, charlen1);        UTF8Char2 := Copy(pstr2, 1, charlen2);        Result := UTF8CompareStrCollated(UTF8Char1, UTF8Char2);        pStr1 := qStr1 - 1;        pStr2 := qStr2 - 1;        // Inc(NumCalled);      end;`
And that came at 16 seconds against your 26 seconds (for 20000000 iterations). That's probably due to the fact you use SetLength for TextStr1, etc... in GetChar.

My method was just a quick and dirty scan to the next non alpha-character and taking that count to fill the UTF8Chars. But it can be even more optimized than it is now. There are also a lot of calls to internal functions (like IsNumber, GetNumber). If these are eliminated it would make some difference. You could start by making isNumber an inline-function.
Code: [Select]
`  function IsNumber(ch: char): boolean; inline;`(just 1 seconds from 16 but every little bit helps )

Another optimization... don't scan for characters when you already encountered a different character (pStr1^<>pStr2) because after that the strings are always different and we need not check any further. i.e. checking for the difference between 'aaxacccbbb1121' and 'aayacccbbb2212'. would only mean checking 'aax' and 'aay'. If you jump out, it will also save some 4 seconds of my test (12 against 16) and in that case it's getting closer to the 4,5 and 6 seconds of Windows-api Code: [Select]
`        repeat          Charlen1 := Charlen1 + UTF8CharacterLength(qStr1);          Charlen2 := Charlen2 + UTF8CharacterLength(qStr2);          if qStr1^  <> qStr2^ then break; // no need to check further          Inc(qStr1);          Inc(qStr2);        until (qStr1^ = #0) or (qStr2^ = #0) or IsNumber(qStr1^) or IsNumber(qStr2^);`
(and maybe the GetNumber could also still be optimized, but I need to look into that)

How do you test the speeds? Do you have a small test program or do you use a real dataset to sort?

(not checked for errors)
Code: [Select]
`function NaturalCompareStr(Str1, Str2: string): integer;var  Num1, Num2: double;  pStr1, pStr2: PChar;  qStr1, qStr2: PChar;  Len1, Len2: integer;  utf8char1: string;  utf8char2: string;  charlen1: integer;  charlen2: integer;  function IsNumber(ch: char): boolean; inline;  begin    Result := ch in ['0'..'9'];  end;  function GetNumber(var pch: PChar; var Len: integer): double;  var    FoundPeriod: boolean;    Count: integer;  begin    FoundPeriod := False;    Result := 0;    while (pch^ <> #0) and      (IsNumber(pch^) or ((not FoundPeriod) and (pch^ = '.'))) do    begin      if pch^ = '.' then      begin        FoundPeriod := True;        Count := 0;      end      else      begin        if FoundPeriod then        begin          Inc(Count);          Result := Result + (Ord(pch^) - Ord('0')) * Power(10, -Count);        end        else          Result := Result * 10 + Ord(pch^) - Ord('0');      end;      Inc(Len);      Inc(pch);    end;  end;begin  if (Str1 <> '') and (Str2 <> '') then  begin    pStr1 := @Str1;    pStr2 := @Str2;    Result := 0;    while not ((pStr1^ = #0) or (pStr2^ = #0)) do    begin      Len1 := 0;      Len2 := 0;      while (pStr1^ = ' ') do      begin        Inc(pStr1);        Inc(Len1);      end;      while (pStr2^ = ' ') do      begin        Inc(pStr2);        Inc(Len2);      end;      if IsNumber(pStr1^) and IsNumber(pStr2^) then      begin        Num1 := GetNumber(pStr1, Len1);        Num2 := GetNumber(pStr2, Len2);        if Num1 < Num2 then          Result := -1        else if Num1 > Num2 then          Result := 1        else        begin          if Len1 < Len2 then            Result := -1          else if Len1 > Len2 then            Result := 1;        end;        Dec(pStr1);        Dec(pStr2);      end      else      begin        Charlen1 := 0;        Charlen2 := 0;        qStr1 := pStr1;        qStr2 := pStr2;        repeat          Charlen1 := Charlen1 + UTF8CharacterLength(qStr1);          Charlen2 := Charlen2 + UTF8CharacterLength(qStr2);          if qStr1^  <> qStr2^ then break; // no need to check further          Inc(qStr1);          Inc(qStr2);        until (qStr1^ = #0) or (qStr2^ = #0) or IsNumber(qStr1^) or IsNumber(qStr2^);        UTF8Char1 := Copy(pstr1, 1, charlen1);        UTF8Char2 := Copy(pstr2, 1, charlen2);        Result := UTF8CompareStrCollated(UTF8Char1, UTF8Char2);        pStr1 := qStr1 - 1;        pStr2 := qStr2 - 1;        // Inc(NumCalled);      end;      if Result <> 0 then        Break;      Inc(pStr1);      Inc(pStr2);    end;  end;  Num1 := length(Str1);  Num2 := length(Str2);  if (Result = 0) and (Num1 <> Num2) then  begin    if Num1 < Num2 then      Result := -1    else      Result := 1;  end;end;`

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #35 on: May 13, 2015, 05:38:09 pm »
I use a real dataset to sort. A memo is populated by random text+numbers lines. And I measure NaturalCompareStr against StrCmpLogicalW and not StrCmpLogicalW against UTF8CompareStrCollated.

#### rvk ##### Re: naturalstrcmp function
« Reply #36 on: May 13, 2015, 05:40:19 pm »
And I measure NaturalCompareStr against StrCmpLogicalW and not StrCmpLogicalW against UTF8CompareStrCollated.
Yes, but I only included the UTF8CompareStrCollated to see if the raw-calling of UTF8CompareStrCollated is fast enough.
Even if I exclude everything in the function (just compare raw strings) you can't come close to the speed of StrCmpLogicalW (due to Copy()-function etc).

But... hey... the only reason you didn't want to use StrCmpLogicalW on Windows was that it didn't take locale into account, wasn't it? (and a small bug?)

I just found out CompareString (beginning from Windows 7) has a parameter SORT_DIGITSASNUMBERS.
And CompareString is locale-aware.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd317759(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/dd318144%28v=vs.85%29.aspx#sort_digits_as_numbers

Is that no option for Windows? (in that case you have the speed back for Windows and you can concentrate on Linux)

(or did you already found that one and dismissed it?)

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #37 on: May 13, 2015, 06:02:27 pm »
But... hey... the only reason you didn't want to use StrCmpLogicalW on Windows was that it didn't take locale into account, wasn't it? (and a small bug?)

No, it takes locale into account. I want a function to Linux because the ones I could find did not works as expected.

Code: [Select]
`begin  {\$IFDEF WINDOWS}    Result := StrCmpLogicalW(PWideChar(UTF8Decode(aList[Index1])), PWideChar(UTF8Decode(aList[Index2])));  {\$ELSE}    Result := NaturalCompareStr(aList[Index1], aList[Index2]);  {\$ENDIF}end;    `
« Last Edit: May 13, 2015, 06:08:41 pm by typo »

#### rvk ##### Re: naturalstrcmp function
« Reply #38 on: May 13, 2015, 06:09:48 pm »
But... hey... the only reason you didn't want to use StrCmpLogicalW on Windows was that it didn't take locale into account, wasn't it? (and a small bug?)

No, it takes locale into account. I want a function to Linux because the ones I could find did not works as expected.
Ah, Ok.
That's weird, though. StrCmpLogicalW and CompareStr both sort the special characters in a different way (than UTF8CompareStrCollated):
Code: [Select]
`1b2a3 a20 aaàáâãäçbcdpic1pic1a2pic1a3pic2pic3pic10pic20`
That's different from your desired (and like UTF8CompareStrCollated does):
Code: [Select]
`1b2a3 a20 aaáàâäãbcçdpic1pic1a2pic1a3pic2pic3pic10pic20`

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #39 on: May 13, 2015, 06:12:57 pm »
I don't use UTF8CompareStrCollated for Windows.

#### rvk ##### Re: naturalstrcmp function
« Reply #40 on: May 13, 2015, 06:19:44 pm »
I don't use UTF8CompareStrCollated for Windows.
In that case, how do you get your desired order?

WideCompareText gives a slightly different order than UTF8CompareStrCollated.

So... don't you mind the difference in ordering, or what do you consider the right one?

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #41 on: May 13, 2015, 06:24:12 pm »
UTF8CompareStrCollated uses WideCompareStr in Linux. The difference between it and WideCompareText is only about case sensibility.

Both Windows function and my function get my desired order.

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #42 on: May 13, 2015, 06:27:10 pm »
So I compare them to know about Windows X Linux performance.

See the attached image.

Of course I measure them both on Windows, but the code for the Linux function (NaturalCompareStr ) is the same as in a Linux system.

I test on Linux only for accuracy, not performance.
« Last Edit: May 13, 2015, 06:50:34 pm by typo »

#### rvk ##### Re: naturalstrcmp function
« Reply #43 on: May 13, 2015, 07:42:04 pm »
UTF8CompareStrCollated uses WideCompareStr in Linux. The difference between it and WideCompareText is only about case sensibility.

Both Windows function and my function get my desired order.
Ah, yes... I made a mistake with my test-program (calling in calling StrCmpLogicalW with wrong encoded parameters ).

Now for speed. I think it will be very difficult to approach the speed of StrCmpLogicalW.

One of the major speedbump is the Copy() (and or SetLength, etc) function. You could (besides the other 2 speed improvements given in my previous post) leave the copy out and just compare the strings in the original buffer pStr^. With Windows/CompareStringW you need to give the string-length so that's not a problem. With Linux/wcscoll it needs #0 at the end so you'll need to save the last character, set #0 and swap it back afterwards.

Next problem is the WideCompareText. It is SLOW. (SLOW SLOW SLOW !!!) You're better off directly calling CompareStringW directly (or wcscoll in case of Linux). I'm not yet sure about the UTF8Decode-calls but with calling CompareStringW (or wcscoll) it already makes a huge difference.

Maybe you could check if you see the same speed-improvement with my version: (not tested for stability):
Code: [Select]
`function NaturalCompareStr(Str1, Str2: string): integer;var  Num1, Num2: double;  pStr1, pStr2: PChar;  qStr1, qStr2: PChar;  Len1, Len2: integer;  char1, char2: char;  function IsNumber(ch: char): boolean; inline;  begin    Result := ch in ['0'..'9'];  end;  function GetNumber(var pch: PChar; var Len: integer): double;  var    FoundPeriod: boolean;    Count: integer;  begin    FoundPeriod := False;    Result := 0;    while (pch^ <> #0) and (IsNumber(pch^) or ((not FoundPeriod) and        (pch^ = '.'))) do    begin      if pch^ = '.' then      begin        FoundPeriod := True;        Count := 0;      end      else      begin        if FoundPeriod then        begin          Inc(Count);          Result := Result + (Ord(pch^) - Ord('0')) * Power(10, -Count);        end        else          Result := Result * 10 + Ord(pch^) - Ord('0');      end;      Inc(Len);      Inc(pch);    end;  end;begin  if (Str1 <> '') and (Str2 <> '') then  begin    pStr1 := @Str1;    pStr2 := @Str2;    Result := 0;    while not ((pStr1^ = #0) or (pStr2^ = #0)) do    begin      Len1 := 0;      Len2 := 0;      while (pStr1^ = ' ') do      begin        Inc(pStr1);        Inc(Len1);      end;      while (pStr2^ = ' ') do      begin        Inc(pStr2);        Inc(Len2);      end;      if IsNumber(pStr1^) and IsNumber(pStr2^) then      begin        Num1 := GetNumber(pStr1, Len1);        Num2 := GetNumber(pStr2, Len2);        if Num1 < Num2 then          Result := -1        else if Num1 > Num2 then          Result := 1        else        begin          if Len1 < Len2 then            Result := -1          else if Len1 > Len2 then            Result := 1;        end;        Dec(pStr1);        Dec(pStr2);      end      else      begin        qStr1 := pStr1;        qStr2 := pStr2;        repeat          if qStr1^ <> qStr2^ then break; // no need to check further          Inc(qStr1);          Inc(qStr2);        until (qStr1^ = #0) or (qStr2^ = #0) or IsNumber(qStr1^) or IsNumber(qStr2^);        Char1 := qstr1^;        Char2 := qstr2^;        qStr1 := #0;        qStr2 := #0;        // SLOW SLOW SLOW SLOW SLOW SLOW SLOW        // Result := WideCompareText(UTF8Decode(pStr1), UTF8Decode(pStr2));        // This needs to be optimized even further        {\$IFDEF WINDOWS}        Result := CompareStringW(LOCALE_USER_DEFAULT, 0,          pWideChar(UTF8Decode(pStr1)), Length(pStr1),          pWideChar(UTF8Decode(pStr2)), Length(pStr2)) - 2;        {\$ELSE}        // not yet implemented        {\$ENDIF}        if Result <> 0 then // no need to set char back if result <> 0        begin          qStr1^ := Char1;          qStr2^ := Char2;          pStr1 := qStr1 - 1;          pStr2 := qStr2 - 1;        end;      end;      if Result <> 0 then Break;      Inc(pStr1);      Inc(pStr2);    end;  end;  Num1 := Length(Str1);  Num2 := Length(Str2);  if (Result = 0) and (Num1 <> Num2) then  begin    if Num1 < Num2 then      Result := -1    else      Result := 1;  end;end;`

#### typo

• Hero Member
•     • Posts: 3051 ##### Re: naturalstrcmp function
« Reply #44 on: May 13, 2015, 09:23:33 pm »
Your version is a bit slower than the mine.