Recent

Author Topic: BUG? Can someone explain  (Read 19737 times)

Bart

  • Hero Member
  • *****
  • Posts: 5611
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #45 on: March 13, 2016, 11:50:42 am »
Bart, I agree.

With what?
You agree I should feel emabarressed?
You agree UtfCompare* should move towards WideCompare* w.r.g. the sign of the result?

I just did some testing.
Converting to WideString and then doing WideCompare* is 35-40 times slower in the case of *CompareStr and appr 2  times slower in the case of *ComaperText.
The bottleneck here being the conversion to WideString (where Utf8ToUtf16 seems to be 2 times faster than Utf8Decode b.t.w.).

(Notice that by itself WideCompareStr (called with 2 "static" widestrings) is already 4.5 times slower than Utf8CompareStr.)

People will start complaining about speed.

Testing 1 million compares
Utf8ComapareStr: 140 ticks (GetTickCount64)
Covert to Widestring then WideCompareStr: 5522 ticks
Ratio: 39.4

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5611
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #46 on: March 13, 2016, 11:53:12 am »
... but even to my standards Bart went over the top a little in using old school expressiveness.

I'm not sure what you mean here.
Probably because English is not my native language.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5611
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #47 on: March 14, 2016, 03:42:50 pm »
Here's anoter approach.
  • Scan the bytes until there is a difference
  • When you hit a difference then find the starting position of the Utf8 codepoints
  • Convert both single codepoints (the ones tat differ) to WideString
  • Call WideCompareStr for those 2 WideStrings
  • If comparison stops in an invalid Utf8 codepoint, do simple byte-compare and return either -2 or +2 as a result.

Code: [Select]
{ Compares UTF8 ecoded strings
  Returns
     0: if S1 = S2
    -1: if S1 < S2 ("alphabetically")
    +1: if S1 > S2
    -2: if S1 < S2, ("bytewise") comparison exited at invalid UTF8 codepoint in either S1 or S2
    +2: if S1 > S2, ("bytewise") comparison exited at invalid UTF8 codepoint in either S1 or S2
}
function Utf8CompareStr(S1: PChar; Count1: SizeInt; S2: PChar; Count2: SizeInt
  ): PtrInt;
var
  Count: SizeInt;
  i, Idx1, Idx2, CL1, CL2: Integer;
  B1, B2: Byte;
  W1, W2: WideString;
begin
  Result := 0;
  if (Count1 > Count2) then
    Count := Count2
  else
    Count := Count1;

  i := 0;
  if (Count > 0) then
  begin
    while (i < Count) do
    begin
      B1 := byte(S1^);
      B2 := byte(S2^);
      if (B1 <> B2) then
      begin
        //writeln('UCS: B1=',IntToHex(B1,2),', B2=',IntToHex(B2,2));
        Break;
      end;
      Inc(S1); Inc(S2); Inc(I);
    end;
  end;
  if (i < Count) then
  begin
    //Fallback result
    Result := B1 - B2;
    if (Result < 0) then
      Result := -2
    else
      Result := 2;
    //writeln('UCS: FallBack Result = ',Result);
    //Try t find start of valid UTF8 codepoints, max 4 bytes long
    Idx1 := i;
    Idx2 := i;
    if (B1 > 127) then
    begin
      CL1 := 1;
      //writeln('UCS: CL1=',CL1,' i=',i,' S1=',StrToHex(S1));
      while (CL1 <= 1) and (Idx1 > 0) and (Idx1 - i < 3) do
      begin
        Dec(S1);
        Dec(Idx1);
        B1 := Byte(S1^);
        //if we find ASCII here, then B1 is part of an invalid codepoint
        if (Byte(S1^) < 128) then
          CL1 := 1
        else
          CL1 := Utf8CharacterStrictLength(S1);
        //writeln('UCS: B1=',IntTohex(B1,2),' Idx1=',Idx1,' CL1=',CL1,' S1=',StrToHex(S1));
      end;
      if (CL1 = 1) then CL1 := 0;
    end
    else
      CL1 := 1; //plain ASCII
    if (CL1 = 0) then
      //Invalid Utf8 codepoint
      Exit;

    if (B2 > 127) then
    begin
      CL2 := 1;
      while (CL2 <= 1) and (Idx2 > 0) and (Idx1 -i < 3) do
      begin
        Dec(S2);
        Dec(Idx2);
        //if we find ASCII here, then B2 is part of an invalid codepoint
        if (Byte(S2^) < 128) then
          CL2 := 1
        else
          CL2 := Utf8CharacterStrictLength(S1);
      end;
      if (CL2 = 1) then CL2 := 0;
    end
    else
      CL2 := 1; //plain ASCII

    if (CL2 = 0) then
      //Invalid Utf8 codepoint
      Exit;


    //writeln('UCS: CL1=',CL1,', CL2=',CL2);
    //writeln('S1 = "',S1,'"');
    //writeln('S2 = "',S2,'"');
    W1 := Utf8ToUtf16(S1, CL1);
    W2 := Utf8ToUtf16(S2, CL2);
    //writeln('UCS: W1 = ',Word(W1[1]),' W2 = ',Word(W2[1]));
    Result := WideCompareStr(W1, W2);
  end
  else
    //Strings are the same up and until size of smallest one
    Result := Count1 - Count2;
  if (Result > 1) then
    Result := 1
  else if (Result < -1) then
    Result := -1;
end;

Iterating the bytes ourself and not using CompareMemrange alone makes the code appr. 10 times slower.
Finding the proper UTF8 codepoints, converting to WideString and calling WideCompareStr adds another factor of appr. 1.6.

Testing 1 million compares (strings of 123 codepoints, all 2-bytes, only last one differs):
Code: [Select]
Old Utf8CompareStr: 0141 ms
New Utf8CompareStr: 1996 ms
Ratio         : 14.2

14-18 times slower, but still better than 35 times slower when using WideCompareStr on both strings.

Bart
« Last Edit: March 14, 2016, 04:35:51 pm by Bart »

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1315
Re: BUG? Can someone explain
« Reply #48 on: March 15, 2016, 11:46:11 am »
Bart, I agree.

With what?
You agree I should feel emabarressed?
You agree UtfCompare* should move towards WideCompare* w.r.g. the sign of the result?
I agreed with your explanation, I didn't see the two posts in-between.

Sorry for the misunderstanding.

Bart

  • Hero Member
  • *****
  • Posts: 5611
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #49 on: March 15, 2016, 01:55:56 pm »
...I didn't see the two posts in-between

Which made it quite hilarious  O:-)

Bart

@All: please test the code I supplied (not for speed but for bugs).

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5611
    • Bart en Mariska's Webstek
Re: BUG? Can someone explain
« Reply #50 on: March 17, 2016, 01:27:54 pm »
I committed the changes to Utf8CompareStr in r51977.

@All: please test with r51977 or later.

@Geepster:

If you are "stuck" with 1.6 then as a workaround try this:
  • Open LazUtf8 unit
  • Find procedure InitLazUtf8;
  • Go to it's implementation (in file winlazutf8.inc)
  • Comment out the line: widestringmanager.CompareStrAnsiStringProc:=@UTF8CompareStr;

Now AnsiCompareStr will behave as before.

Bart

 

TinyPortal © 2005-2018