Lazarus

Free Pascal => FPC development => Topic started by: jarto on October 14, 2014, 09:24:36 pm

Title: StringReplace extremely slow with big strings
Post by: jarto on October 14, 2014, 09:24:36 pm
I noticed this programming with Delphi 2007, but FPC does suffer from the same problem: StringReplace is awful with long strings. I ran into this in real world while wondering how Indy's headers could be so slow with big POST buffers. It converted '+' to ' ' with StringReplace.

When I looked at the code, the reason was obvious:  StringReplace does a lot of string manipulation, copy and concatenation. It gets really ugly fast, when there's a lot to replace and the string is big.

I wrote a new version, which is a lot more optimized. It can certainly be optimized more, but the general idea is to calculate the length of the result string, set the length once and only manipulate the contents of it.

Here are some test results:

58058 bytes Single char, replace all 203 x faster
58058 bytes Same length, replace all 61 x faster
58058 bytes Single char, replace all, ignore case 68 x faster
58058 bytes Same length, replace all, ignore case 34 x faster
58058 bytes New longer, replace all 32 x faster
58058 bytes New shorter, replace all 32 x faster
58058 bytes New longer, replace all, ignore case 22 x faster
58058 bytes New shorter, replace all, ignore case 22 x faster
58058 bytes New empty, replace all 35 x faster
58058 bytes New empty, replace all, ignore case 23 x faster
116058 bytes Single char, replace all 260 x faster
116058 bytes Same length, replace all 148 x faster
116058 bytes Single char, replace all, ignore case 153 x faster
116058 bytes Same length, replace all, ignore case 75 x faster
116058 bytes New longer, replace all 62 x faster
116058 bytes New shorter, replace all 61 x faster
116058 bytes New longer, replace all, ignore case 45 x faster
116058 bytes New shorter, replace all, ignore case 44 x faster
116058 bytes New empty, replace all 68 x faster
116058 bytes New empty, replace all, ignore case 50 x faster
174058 bytes Single char, replace all 518 x faster
174058 bytes Same length, replace all 228 x faster
174058 bytes Single char, replace all, ignore case 219 x faster
174058 bytes Same length, replace all, ignore case 113 x faster
174058 bytes New longer, replace all 100 x faster
174058 bytes New shorter, replace all 104 x faster
174058 bytes New longer, replace all, ignore case 70 x faster
174058 bytes New shorter, replace all, ignore case 72 x faster
174058 bytes New empty, replace all 107 x faster
174058 bytes New empty, replace all, ignore case 80 x faster
232058 bytes Single char, replace all 634 x faster
232058 bytes Same length, replace all 291 x faster
232058 bytes Single char, replace all, ignore case 305 x faster
232058 bytes Same length, replace all, ignore case 152 x faster
232058 bytes New longer, replace all 141 x faster
232058 bytes New shorter, replace all 140 x faster
232058 bytes New longer, replace all, ignore case 98 x faster
232058 bytes New shorter, replace all, ignore case 98 x faster
232058 bytes New empty, replace all 153 x faster
232058 bytes New empty, replace all, ignore case 107 x faster
290058 bytes Single char, replace all 1043 x faster
290058 bytes Same length, replace all 487 x faster
290058 bytes Single char, replace all, ignore case 542 x faster
290058 bytes Same length, replace all, ignore case 311 x faster
290058 bytes New longer, replace all 254 x faster
290058 bytes New shorter, replace all 249 x faster
290058 bytes New longer, replace all, ignore case 183 x faster
290058 bytes New shorter, replace all, ignore case 179 x faster
290058 bytes New empty, replace all 277 x faster
290058 bytes New empty, replace all, ignore case 197 x faster

And the new code is: (Update: This original code has a bug in it, look below for a fixed and improved version)

Code: [Select]
Function NewStringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  PatLength,NewPatLength,P,i,PatCount,PrevP: Integer;
  c,d: pchar;
begin
  PatLength:=Length(OldPattern);
  if PatLength=0 then begin
    Result:=S;
    exit;
  end;

  if rfIgnoreCase in Flags then begin
    Srch:=AnsiUpperCase(S);
    OldPat:=AnsiUpperCase(OldPattern);
  end else begin
    Srch:=S;
    OldPat:=OldPattern;
  end;

  PatLength:=Length(OldPat);
  if Length(NewPattern)=PatLength then begin
    //Result length will not change
    Result:=S;
    P:=1;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        for i:=1 to PatLength do
          Result[P+i-1]:=NewPattern[i];
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
      end;
    until p=0;
  end else begin
    //Different pattern length -> Result length will change
    //To avoid creating a lot of temporary strings, we count how many
    //replacements we're going to make.
    P:=1; PatCount:=0;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        inc(P,PatLength);
        inc(PatCount);
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    if PatCount=0 then begin
      Result:=S;
      exit;
    end;
    NewPatLength:=Length(NewPattern);
    SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
    P:=1; PrevP:=0;
    c:=pchar(Result); d:=pchar(S);
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        for i:=PrevP+1 to P-1 do begin
          c^:=d^;
          inc(c); inc(d);
        end;
        for i:=1 to NewPatLength do begin
          c^:=NewPattern[i];
          inc(c);
        end;
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
        inc(d,PatLength);
        PrevP:=P-1;
      end else begin
        for i:=PrevP+1 to Length(S) do begin
          c^:=d^;
          inc(c); inc(d);
        end;
      end;
    until p=0;
  end;
end;

Comments are welcome. It'd be nice to get this one or an even more improved one to the FPC source tree.
Title: Re: StringReplace extremely slow with big strings
Post by: Leledumbo on October 14, 2014, 11:15:25 pm
Quote
It'd be nice to get this one or an even more improved one to the FPC source tree.
Post it as a patch to the bugtracker. I'm not sure whether we have unit testing for rtl units, someone else might be able to confirm this and you can run your code against the unit tests. If it generates the same result as the original one, your code is likely to be accepted.
Title: Re: StringReplace extremely slow with big strings
Post by: circular on October 15, 2014, 12:14:34 am
Hi,

I would propose to change this:
Code: [Select]
        for i:=1 to PatLength do
          Result[P+i-1]:=NewPattern[i];
Into
Code: [Select]
        move(NewPattern[1], Result[P], PatLength);
And
Code: [Select]
        for i:=1 to NewPatLength do begin
          c^:=NewPattern[i];
          inc(c);
        end;
Into
Code: [Select]
        Move(NewPattern[1],c^,NewPatLength);
        inc(c, NewPatLength);

Also I think that
Code: [Select]
if not (rfReplaceAll in Flags) then exit;Will not copy the end of the string

Also the copying of the end of the string can also be done with Move
Title: Re: StringReplace extremely slow with big strings
Post by: jarto on October 15, 2014, 04:56:42 am
circular, thank you for your suggestions and for noticing how the "replace one" -case didn't copy until the end of the string. Here's the updated code.

Before submitting to the bugtracker, I wonder if it's ok that the function uses posex?

Code: [Select]
Function NewStringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
  c,d: pchar;
begin
  PatLength:=Length(OldPattern);
  if PatLength=0 then begin
    Result:=S;
    exit;
  end;

  if rfIgnoreCase in Flags then begin
    Srch:=AnsiUpperCase(S);
    OldPat:=AnsiUpperCase(OldPattern);
  end else begin
    Srch:=S;
    OldPat:=OldPattern;
  end;

  PatLength:=Length(OldPat);
  if Length(NewPattern)=PatLength then begin
    //Result length will not change
    Result:=S;
    P:=1;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        move(NewPattern[1],Result[P],PatLength);
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
      end;
    until p=0;
  end else begin
    //Different pattern length -> Result length will change
    //To avoid creating a lot of temporary strings, we count how many
    //replacements we're going to make.
    P:=1; PatCount:=0;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        inc(P,PatLength);
        inc(PatCount);
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    if PatCount=0 then begin
      Result:=S;
      exit;
    end;
    NewPatLength:=Length(NewPattern);
    SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
    P:=1; PrevP:=0;
    c:=pchar(Result); d:=pchar(S);
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        Cnt:=P-PrevP-1;
        if Cnt>0 then begin
          Move(d^,c^,Cnt);
          inc(c,Cnt);
          inc(d,Cnt);
        end;
        if NewPatLength>0 then begin
          Move(NewPattern[1],c^,NewPatLength);
          inc(c,NewPatLength);
        end;
        inc(P,PatLength);
        inc(d,PatLength);
        PrevP:=P-1;
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    Cnt:=Length(S)-PrevP;
    if Cnt>0 then Move(d^,c^,Cnt);
  end;
end;
Title: Re: StringReplace extremely slow with big strings
Post by: minesadorada on October 15, 2014, 08:30:18 am
@jarto: Is there also a UTF version (NewStringReplaceUTF8) and ANSI version (NewAnsiStringReplace) in the pipeline?  They could replace the ones in the fileutil unit.
Title: Re: StringReplace extremely slow with big strings
Post by: BigChimp on October 15, 2014, 08:44:46 am
Before submitting to the bugtracker, I wonder if it's ok that the function uses posex?
If the unit posex is in is already referenced by/present in the unit stringreplace is in (sorry to lazy to look it up now) I don't see a problem?
Title: Re: StringReplace extremely slow with big strings
Post by: jarto on October 15, 2014, 09:17:56 am
@jarto: Is there also a UTF version (NewStringReplaceUTF8) and ANSI version (NewAnsiStringReplace) in the pipeline?  They could replace the ones in the fileutil unit.

I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever. Unless you call it with rfIgnoreCase, of course.
Title: Re: StringReplace extremely slow with big strings
Post by: BigChimp on October 15, 2014, 09:35:16 am
I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.
I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.

However, I'd focus first on the current version you have and see what the fpc devs have to say when you submit your patch...
Title: Re: StringReplace extremely slow with big strings
Post by: BigChimp on October 15, 2014, 01:59:07 pm
BTW see http://bugs.freepascal.org/view.php?id=22501 where alternative stringreplace functions in conjunction with FPC trunk are discussed.
Title: Re: StringReplace extremely slow with big strings
Post by: circular on October 15, 2014, 05:16:55 pm
I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.
I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.
In my view, if it is a case sensitive search, it works with UTF8 too, because each UTF8 character start with a byte that indicate its length and the next bytes are always in a certain range. So when providing an UTF8 search string and an UTF8 pattern, it is not possible to identify a pattern without matching the beginning and ending of characters.

However, this makes a difference when the search is not case sensitive. The resulting string will certainly be corrupted if the search pattern or search string contains non ASCII characters when the AnsiUpperCase is applied on an UTF8 string.

Also it will not work for sure with widestrings.
Title: Re: StringReplace extremely slow with big strings
Post by: minesadorada on October 15, 2014, 05:33:24 pm
In StrUtils,
AnsiReplaceText just calls StringReplace with the flagset [rfReplaceAll,rfIgnoreCase]

=edit: oops! I meant StrUtils==
Title: Re: StringReplace extremely slow with big strings
Post by: jarto on October 15, 2014, 05:54:05 pm
Well, it's submitted. Let's see if it's accepted.

http://bugs.freepascal.org/view.php?id=26864
Title: Re: StringReplace extremely slow with big strings
Post by: User137 on October 15, 2014, 09:57:18 pm
Would it be faster if you would save the search results in TList (faster than dynamic array for this case) from first pass? Just in theory, it could be more optimal. You would need to store the PosEx result, an integer.
Title: Re: StringReplace extremely slow with big strings
Post by: jarto on October 16, 2014, 05:57:17 pm
I'm not sure that StringReplace should be more complicated to the point of using a TList to store values. It would probably cause it to be slower with short strings while speeding up faster ones.
Title: Re: StringReplace extremely slow with big strings
Post by: skalogryz on October 16, 2014, 06:53:57 pm
you need more efficient substring search than PosEx().
Check this out.
Code: [Select]
uses Classes, SysUtils, StrUtils, dateutils;

...function NewStringReplace...

var
  t: TDateTime;
  src : string;
  i   : integer;
  tmp : string;
  l   : string;
begin
  tmp:='aaaaaaaab';
  l:=Copy(tmp,1,length(tmp)-1);
  src:='';
  for i:=0 to 500000 do src:=src+l;
  writeln('prepared!');
  src:=src+tmp;

   t:=now;
   StringReplace(src,tmp,'*',[rfReplaceAll]);
   writeln('rtl: ',MilliSecondsBetween(now,t));


   t:=now;
   NewStringReplace(src,tmp,'*',[rfReplaceAll]);
   writeln('new: ', MilliSecondsBetween(now,t));
end.
Title: Re: StringReplace extremely slow with big strings
Post by: jarto on October 16, 2014, 07:49:42 pm
you need more efficient substring search than PosEx().

This is too funny. PosEx() uses indexbyte to find the first byte. It's even assembler code.
Pos() uses a simple while loop and is faster :-D
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 22, 2018, 12:36:08 pm
Hi, any idea when will these fixes will be applied in an official Lazarus release?
Title: Re: StringReplace extremely slow with big strings
Post by: ASerge on January 22, 2018, 04:03:36 pm
If the unit posex is in is already referenced by/present in the unit stringreplace is in (sorry to lazy to look it up now) I don't see a problem?
New StringReplace use PosEx, which in StrUtils, which uses SysUtils, which contain StringReplace.
Of course, the cyclic dependency can be broken through the implementation section, but I do not like this decision.
Title: Re: StringReplace extremely slow with big strings
Post by: marcov on January 22, 2018, 04:06:44 pm
Afaik in some recent version POS() has gotten an extra parameter and is now posex like. Might change the cycle.

IIRC it is even merged to 3.0.4, but am not 100% sure.
Title: Re: StringReplace extremely slow with big strings
Post by: RAW on January 22, 2018, 07:23:46 pm
Maybe this works for you....
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 23, 2018, 08:08:36 am
Thanks, I looked at the code... since it uses Posex, I have a bad feeling it won't works with UTF8... ?

Until now I used this code from Lazarus bugzilla:

Code: Pascal  [Select][+][-]
  1. //ANSI
  2. Function StringReplaceNew(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  3. var
  4.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  5.   PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
  6.   c,d: pchar;
  7. begin
  8.   PatLength:=Length(OldPattern);
  9.   if PatLength=0 then begin
  10.     Result:=S;
  11.     exit;
  12.   end;
  13.  
  14.  
  15.   if rfIgnoreCase in Flags then begin
  16.     Srch:=AnsiUpperCase(S);
  17.     OldPat:=AnsiUpperCase(OldPattern);
  18.   end else begin
  19.     Srch:=S;
  20.     OldPat:=OldPattern;
  21.   end;
  22.  
  23.  
  24.   PatLength:=Length(OldPat);
  25.   if Length(NewPattern)=PatLength then begin
  26.     //Result length will not change
  27.     Result:=S;
  28.     P:=1;
  29.     repeat
  30.       P:=PosEx(OldPat,Srch,P);
  31.       if P>0 then begin
  32.         move(NewPattern[1],Result[P],PatLength);
  33.         if not (rfReplaceAll in Flags) then exit;
  34.         inc(P,PatLength);
  35.       end;
  36.     until p=0;
  37.   end else begin
  38.     //Different pattern length -> Result length will change
  39.     //To avoid creating a lot of temporary strings, we count how many
  40.     //replacements we're going to make.
  41.     P:=1; PatCount:=0;
  42.     repeat
  43.       P:=PosEx(OldPat,Srch,P);
  44.       if P>0 then begin
  45.         inc(P,PatLength);
  46.         inc(PatCount);
  47.         if not (rfReplaceAll in Flags) then break;
  48.       end;
  49.     until p=0;
  50.     if PatCount=0 then begin
  51.       Result:=S;
  52.       exit;
  53.     end;
  54.     NewPatLength:=Length(NewPattern);
  55.     SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
  56.     P:=1; PrevP:=0;
  57.     c:=pchar(Result); d:=pchar(S);
  58.     repeat
  59.       P:=PosEx(OldPat,Srch,P);
  60.       if P>0 then begin
  61.         Cnt:=P-PrevP-1;
  62.         if Cnt>0 then begin
  63.           Move(d^,c^,Cnt);
  64.           inc(c,Cnt);
  65.           inc(d,Cnt);
  66.         end;
  67.         if NewPatLength>0 then begin
  68.           Move(NewPattern[1],c^,NewPatLength);
  69.           inc(c,NewPatLength);
  70.         end;
  71.         inc(P,PatLength);
  72.         inc(d,PatLength);
  73.         PrevP:=P-1;
  74.         if not (rfReplaceAll in Flags) then break;
  75.       end;
  76.     until p=0;
  77.     Cnt:=Length(S)-PrevP;
  78.     if Cnt>0 then Move(d^,c^,Cnt);
  79.   end;
  80. end;    

but it makes mess with UTF8 strings (even though I am replacing ansi strings with ansi strings).
So I tried to modify it to:


Code: Pascal  [Select][+][-]
  1. //UTF8
  2. Function StringReplaceNew(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  3. var
  4.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  5.   PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
  6.   c,d: pchar;
  7. begin
  8.   PatLength:=utf8Length(OldPattern);
  9.   if PatLength=0 then begin
  10.     Result:=S;
  11.     exit;
  12.   end;
  13.  
  14.  
  15.   if rfIgnoreCase in Flags then begin
  16.     Srch:=UTF8UpperCase(S);
  17.     OldPat:=UTF8UpperCase(OldPattern);
  18.   end else begin
  19.     Srch:=S;
  20.     OldPat:=OldPattern;
  21.   end;
  22.  
  23.  
  24.   PatLength:=utf8Length(OldPat);
  25.   if utf8Length(NewPattern)=PatLength then begin
  26.     //Result length will not change
  27.     Result:=S;
  28.     P:=1;
  29.     repeat
  30.       P:=utf8Pos(OldPat,Srch,P);
  31.       if P>0 then begin
  32.         move(NewPattern[1],Result[P],PatLength);
  33.         if not (rfReplaceAll in Flags) then exit;
  34.         inc(P,PatLength);
  35.       end;
  36.     until p=0;
  37.   end else begin
  38.     //Different pattern length -> Result length will change
  39.     //To avoid creating a lot of temporary strings, we count how many
  40.     //replacements we're going to make.
  41.     P:=1; PatCount:=0;
  42.     repeat
  43. //      P:=utf8Pos(OldPat,Srch,P);   //This counts only the occurences of the sougth string, so there is possibly no reason to use UFT8Pos, whish is extremely slow
  44.       P:=Posex(OldPat,Srch,P);
  45.       if P>0 then begin
  46.         inc(P,PatLength);
  47.         inc(PatCount);
  48.         if not (rfReplaceAll in Flags) then break;
  49.       end;
  50.     until p=0;
  51.     if PatCount=0 then begin
  52.       Result:=S;
  53.       exit;
  54.     end;
  55.     NewPatLength:=utf8Length(NewPattern);
  56.     SetLength(Result,utf8Length(S)+PatCount*(NewPatLength-PatLength));
  57.     P:=1; PrevP:=0;
  58.     c:=pchar(Result); d:=pchar(S);
  59.     repeat
  60.       P:=utf8Pos(OldPat,Srch,P);
  61.       if P>0 then begin
  62.         Cnt:=P-PrevP-1;
  63.         if Cnt>0 then begin
  64.           Move(d^,c^,Cnt);
  65.           inc(c,Cnt);
  66.           inc(d,Cnt);
  67.         end;
  68.         if NewPatLength>0 then begin
  69.           Move(NewPattern[1],c^,NewPatLength);
  70.           inc(c,NewPatLength);
  71.         end;
  72.         inc(P,PatLength);
  73.         inc(d,PatLength);
  74.         PrevP:=P-1;
  75.         if not (rfReplaceAll in Flags) then break;
  76.       end;
  77.     until p=0;
  78.     Cnt:=utf8Length(S)-PrevP;
  79.     if Cnt>0 then Move(d^,c^,Cnt);
  80.   end;
  81. end;
I have no idea if it works properly, but is extremely slow.  :'(  It seems to me that utf8Pos is the bottle neck. A very narrow one.
Title: Re: StringReplace extremely slow with big strings
Post by: Thaddy on January 23, 2018, 09:10:24 am
If you replace ansi with ansi, plz do NOT use string in Lazarus: that is UTF8. Rename string to AnsiString.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 23, 2018, 10:12:27 am
@Thaddy, I am not sure that I understood you right and that you understood me. I am replacing Ansi with Ansi in an UTF8 string.

In other words: Strings which I am replacing are with ascii code <127 (basic Latin, &%$# etc). But the sting which contains them may also contain uтf8 characters.
Title: Re: StringReplace extremely slow with big strings
Post by: JuhaManninen on January 23, 2018, 12:01:32 pm
If you replace ansi with ansi, plz do NOT use string in Lazarus: that is UTF8. Rename string to AnsiString.
Thaddy, in Lazarus by default String = AnsiString.
You should however use type String always because the code is then compatible with Delphi and with future mode DelphiUnicode solutions.
Remember, Ansi...() functions like AnsiUpperCase() work transparently with different encodings then.

If someone still uses the old Windows codepages, then please switch to Unicode.

CM630, please see the example:
 http://wiki.lazarus.freepascal.org/UTF8_strings_and_characters#Search_and_copy
Pos() and PosEx() work in StringReplaceNew for the same reasons they work in the given example.
CodeUnit (Pascal Char) resolution is still very useful with variable width encodings, and is fast.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 23, 2018, 12:58:59 pm

1. What I got from the example is that I should keep PosEx and Length in StringReplaceNew.


2. CodeUnit is some type of variable. But I could no find more info about it.
3. My current (old) implementation of StringReplaceNew messes the strings up.
4. I will try RAW's solution and feedback.
    Edit: I tried it. My app crashes with a range check error. :(
Title: Re: StringReplace extremely slow with big strings
Post by: JuhaManninen on January 23, 2018, 01:11:37 pm
2. CodeUnit is some type of variable. But I could no find more info about it.
No. It is a concept of Unicode. It is the smallest atomic building block. A code point can consist of one or more code units.
In Pascal terms it is type Char, a byte with UTF-8 and a word with UTF-16.
See:
 https://en.wikipedia.org/wiki/Character_encoding#Terminology
and search for "code unit".
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 24, 2018, 04:41:56 pm
3. My current (old) implementation of StringReplaceNew messes the strings up.

You did not show the current implementation. I assume something like:
Code: Pascal  [Select][+][-]
  1. Function StringReplaceNew(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. var
  3.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  4.   PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
  5.   c,d: pchar;
  6. begin
  7.   PatLength:=Length(OldPattern);
  8.   if PatLength=0 then
  9.     exit(S);
  10.  
  11.   if rfIgnoreCase in Flags then begin
  12.     Srch:=UTF8UpperCase(S);
  13.     OldPat:=UTF8UpperCase(OldPattern);
  14.   end else begin
  15.     Srch:=S;
  16.     OldPat:=OldPattern;
  17.   end;
  18.  
  19.   PatLength:=Length(OldPat);
  20.   if Length(NewPattern)=PatLength then begin
  21.     //Result length will not change
  22.     Result:=S;
  23.     P:=1;
  24.     repeat
  25.       P:=PosEx(OldPat,Srch,P);
  26.       if P>0 then begin
  27.         move(NewPattern[1],Result[P],PatLength);
  28.         if not (rfReplaceAll in Flags) then exit;
  29.         inc(P,PatLength);
  30.       end;
  31.     until p=0;
  32.   end else begin
  33.     //Different pattern length -> Result length will change
  34.     //To avoid creating a lot of temporary strings, we count how many
  35.     //replacements we're going to make.
  36.     P:=1; PatCount:=0;
  37.     repeat
  38.       P:=PosEx(OldPat,Srch,P);
  39.       if P>0 then begin
  40.         inc(P,PatLength);
  41.         inc(PatCount);
  42.         if not (rfReplaceAll in Flags) then break;
  43.       end;
  44.     until p=0;
  45.     if PatCount=0 then begin
  46.       Result:=S;
  47.       exit;
  48.     end;
  49.     NewPatLength := Length(NewPattern);
  50.     SetLength(Result, Length(S) + PatCount*(NewPatLength-PatLength));
  51.     P:=1; PrevP:=0;
  52.     c:=pchar(Result);
  53.     d:=pchar(S);
  54.     repeat
  55.       P:=PosEx(OldPat,Srch,P);
  56.       if P>0 then begin
  57.         Cnt:=P-PrevP-1;
  58.         if Cnt>0 then begin
  59.           Move(d^,c^,Cnt);
  60.           inc(c,Cnt);
  61.           inc(d,Cnt);
  62.         end;
  63.  
  64.         Move(NewPattern[1],c^,NewPatLength);
  65.         inc(c,NewPatLength);
  66.         inc(P,PatLength);
  67.         inc(d,PatLength);
  68.         PrevP:=P-1;
  69.         if not (rfReplaceAll in Flags) then break;
  70.       end;
  71.     until p=0;
  72.     Cnt:=Length(S)-PrevP;
  73.     Move(d^,c^,Cnt);
  74.   end;
  75. end;

I believe you have a problem with rfIgnoreCase. Your code assumes that Srch and S have the same length. Upper case and lower case do *not* necessarily have the same length. Check the source code for UTF8UpperCase:
Code: Pascal  [Select][+][-]
  1.       if IsTurkish and (AInStr[InCounter] = 'i') then
  2.       begin
  3.         SetLength(Result,Length(Result)+1);// Increase the buffer
  4. ...
  5.         inc(InCounter);
  6.         inc(OutCounter,2);
and other calls to CorrectOutStrSize. You'll have to build the index for both strings and shift the location based on it.

Edit:
UTF8StringReplace from unit LazUTF8 has the same bug. It uses UTF8LowerCase instead.

Code: Pascal  [Select][+][-]
  1.   { To test the bug with UTF8StringReplace }
  2.   { LATIN CAPITAL LETTER A WITH STROKE' (U+23A) Ⱥ}
  3.   SetLength(sC8BA,2);
  4.   sC8BA[1] := #$C8;
  5.   sC8BA[2] := #$BA; //or #$BE
  6.  
  7.   { To test the bug with NewStringReplace }
  8.   { LATIN SMALL LETTER TURNED ALPHA' (U+0252) ɒ}
  9.   SetLength(sC992,2);
  10.   sC992[1] := #$C9;
  11.   sC992[2] := #$92;
Title: Re: StringReplace extremely slow with big strings
Post by: JuhaManninen on January 25, 2018, 10:45:22 pm
I believe you have a problem with rfIgnoreCase. Your code assumes that Srch and S have the same length. Upper case and lower case do *not* necessarily have the same length.
True. It also means functions UTF8SwapCase and my recently added UTF8ProperCase in unit LazUTF8 are buggy.
On the positive side the lengths are the same in most cases but that is no excuse for a bug of course.

Quote
UTF8StringReplace from unit LazUTF8 has the same bug. It uses UTF8LowerCase instead.
Ok, I didn't realize that one. The function has been there forever.

If somebody wants to fix those bugs, please do. I can commit the fixes.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 26, 2018, 08:44:12 am
You did not show the current implementation. I assume something like:
I did in post 20 (first function). Or you mean sth. else?


I believe you have a problem with rfIgnoreCase. Your code assumes that Srch and S have the same length. Upper case and lower case do *not* necessarily have the same length. Check the source code for UTF8UpperCase:
I think your belief is right! I removed all rfIgnoreCase and now I do not notice any corruption of the string (but I cannot say for sure that there is not any).

On the positive side the lengths are the same in most cases but that is no excuse for a bug of course.

On the negative side StringReplace will never work for Turkish and some other languages.
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 27, 2018, 05:20:08 am
Here is a possible replacement for UTF8StringReplace. It needs to be tested.
Code: Pascal  [Select][+][-]
  1. function UTF8StringReplace_1(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. var
  3.   Matches: array of integer;
  4.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  5.   PatLength,P,Count, Capacity, i: Integer;
  6.   p1,p2, MemPos: pchar;
  7.   i1,i2, l: integer;
  8. begin
  9.   if (Length(OldPattern)=0) or (Length(S)=0) then
  10.     exit(S);
  11.  
  12.   if rfIgnoreCase in Flags then begin
  13.     Srch := UTF8LowerCase(S);
  14.     OldPat := UTF8LowerCase(OldPattern);
  15.   end else begin
  16.     Srch := S;
  17.     OldPat := OldPattern;
  18.   end;
  19.  
  20.   { Find matches }
  21.   Count := 0;
  22.   Capacity := 10;
  23.   SetLength(Matches, Capacity);
  24.   P:=PosEx(OldPat, Srch, 1);
  25.   if P=0 then exit(s);
  26.   if rfReplaceAll in Flags then  { TODO : Consider using FindMatchesBoyerMooreCaseSensitive }
  27.     while p<>0 do begin
  28.       Matches[Count] := p;
  29.       inc(Count);
  30.       if Count=Capacity then begin { Grow }
  31.         inc(Capacity, 10);
  32.         SetLength(Matches, Capacity);
  33.       end;
  34.       P:=PosEx(OldPat,Srch,P+1);
  35.     end
  36.   else begin
  37.       Matches[Count] := p;
  38.       inc(Count);
  39.   end;
  40.  
  41.   if rfIgnoreCase in Flags then begin { Correct match positions }
  42.     p1 := @Srch[1]; p2 := @S[1];
  43.     for i := 0 to Count-1 do begin
  44.       MemPos := @Srch[Matches[i]];
  45.       while p1<MemPos do begin
  46.         inc(p1, UTF8CharacterLengthFast(p1));
  47.         inc(p2, UTF8CharacterLengthFast(p2));
  48.       end;
  49.       Matches[i] := p2-@S[1]+1;
  50.     end;
  51.   end;
  52.  
  53.   PatLength := Length(OldPat);
  54.   if Length(NewPattern)=PatLength then begin
  55.     //Result length will not change
  56.     Result:=S;
  57.     if (rfReplaceAll in Flags) then
  58.       for i := 0 to Count-1 do move(NewPattern[1], Result[Matches[i]], PatLength)
  59.     else
  60.       move(NewPattern[1], Result[Matches[0]], PatLength);
  61.   end else begin
  62.     SetLength(Result, Length(S) + Count*(Length(NewPattern)-Length(OldPattern)));
  63.     i1 := 1; i2 := 1;
  64.     for i := 0 to Count-1 do begin
  65.       l := Matches[i]-i1;
  66.       move(S[i1], Result[i2], l); { Copy text before pattern }
  67.       inc(i2, l); { Move to the location of the pattern }
  68.       i1 := Matches[i]+Length(OldPattern); { Move over the old pattern }
  69.  
  70.       move(NewPattern[1], Result[i2], Length(NewPattern));  { Copy the new pattern }
  71.       inc(i2, Length(NewPattern));
  72.     end;
  73.     if i1<=Length(S) then
  74.       move(S[i1], Result[i2], Length(S)-i1+1); { Copy leftover text }
  75.   end;
  76. end;
Title: Re: StringReplace extremely slow with big strings
Post by: Thaddy on January 27, 2018, 09:27:19 am
engkin given
Code: Pascal  [Select][+][-]
  1. { TODO : Consider using FindMatchesBoyerMooreCaseSensitive }
Note that in trunk the BoyerMoore routines are now functions that return true if there are matches or false otherwise.
May simplify using it. They have also been made faster recently.
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 27, 2018, 01:20:01 pm
Thank you Thaddy. I should use the trunk version then.
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 28, 2018, 07:47:47 pm
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Lazarus IDE (which uses UTF8UpperCase) does follow this rule and changes ß to SS.

Similar ligatures that should expand to a few characters:
ff fi fl ffi ffl ſt st (U+FB00 to U+FB06)

UTF8UpperCase (Not sure if by design or a bug?) does not expand them. But it does expand ß.

I am not aware of any codepoint that gives more than one codepoint for its lower case.
Title: Re: StringReplace extremely slow with big strings
Post by: PascalDragon on January 28, 2018, 08:46:51 pm
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS ;) (more info here (https://en.wikipedia.org/wiki/Capital_%E1%BA%9E))
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 06:37:24 am
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS ;) (more info here (https://en.wikipedia.org/wiki/Capital_%E1%BA%9E))
Thank you! Unicode Case Charts (http://www.unicode.org/charts/case/index.html) still show that the upper case for ß is SS.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 29, 2018, 06:59:47 am
Here is a possible replacement for UTF8StringReplace. It needs to be tested...
Tested:
Code: [Select]
TApplication.HandleException Range check error
  Stack trace:
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas
  $004306EF  TFRMMAIN__OPENXML,  line 1976 of frmMainSRC.pas
  $0043C837  TFRMMAIN__FORMDROPFILES,  line 3043 of frmMainSRC.pas
  $0041F0E8  TCUSTOMFORM__INTFDROPFILES,  line 2619 of ./include/customfor
  $00522AD7  TWINDOWPROCHELPER__HANDLEDROPFILES,  line 1107 of ./win32/win
  $00525136  TWINDOWPROCHELPER__DOWINDOWPROC,  line 2246 of ./win32/win32c
  $00525F3B  WINDOWPROC,  line 2657 of ./win32/win32callback.inc
  $005FE933  CUSTOMFORMWNDPROC,  line 386 of ./win32/win32wsforms.pp
  $76D062FA
  $76D06D3A
  $76D077C4
  $76D0788A
  $00526DAD  TWIN32WIDGETSET__APPPROCESSMESSAGES,  line 407 of ./win32/win
  $0042476D  TAPPLICATION__HANDLEMESSAGE,  line 1276 of ./include/applicat
  $00424B8E  TAPPLICATION__RUNLOOP,  line 1413 of ./include/application.in
  $00470E40  TWIDGETSET__APPRUN,  line 54 of ./include/interfacebase.inc
  $00424B4E  TAPPLICATION__RUN,  line 1401 of ./include/application.inc
This happens after several replacements, I will try to remove them one by one.
Unfortunately I cannot share my text files (since they are not mine).


Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS  (more info here)
Capital Eszett sound pretty much to me like a capital „ь‟ (small er)  :D
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 07:03:51 am
Quote
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas

I assume you changed its name to STRINGREPLACENEW. Which line is 966?
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 29, 2018, 07:05:10 am
Quote
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas

I assume you changed its name to STRINGREPLACENEW.

Precisely.


Which line is 966?
Exactly
Code: Pascal  [Select][+][-]
  1.       move(NewPattern[1], Result[i2], Length(NewPattern));  { Copy the new pattern }
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 07:09:23 am
It seems that Result does not have the correct size. \
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 07:14:12 am
Or your text is not valid UTF8. Can you change UTF8CharacterLengthFast to UTF8CharacterLength and try again?
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 29, 2018, 07:22:50 am
Or your text is not valid UTF8. Can you change UTF8CharacterLengthFast to UTF8CharacterLength and try again?
Same behaviour. But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.

I wrapped Line 966 in try ... except writeln oldpattern... new pattern, and I got that this happens when I am executing this line:
    FileContents:=StringReplaceNew(FileContents, ' cellpadding="0" cellspacing="0"',    '', ReplaceParams);
I commented it out and now the app does not crash.
I have some 20 calls of the function StringReplaceNew before it and they seem to be done okay (at least do not cause a crash).

EDIT: I tried

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   FileContents: string;
  4. begin
  5.   FileContents:= 'алабала cellpadding="0" cellspacing="0" тралала' + LineEnding + 'алабала cellpadding="0" cellspacing="0" тралала' ;
  6.   FileContents:=StringReplaceNew(FileContents, ' cellpadding="0" cellspacing="0"',    '',[rfReplaceAll, rfIgnoreCase]);
  7.   ShowMessage (FileContents);
  8. end;
and it works fine.

Edit: In other files, which had no problems with the previous StringReplaceNew routine I started having troubles. Not only with cellpadding...
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 07:56:48 am
But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.
Use FindInvalidUTF8Character from LazUTF8. Something like:
Code: Pascal  [Select][+][-]
  1.   LocationOfBadCode := FindInvalidUTF8Character(@FileContents[1], Length(FileContents), True);
  2.   If LocationOfBadCode<>-1 then
  3.      WriteLn('Invalid UTF8 string');

Edit:
If the text is valid, does the problem happen with and without rfIgnoreCase?
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 29, 2018, 08:20:52 am
I did:


Code: Pascal  [Select][+][-]
  1. function StringReplaceNew(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. var
  3.   Matches: array of integer;
  4.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  5.   PatLength,P,Count, Capacity, i: Integer;
  6.   p1,p2, MemPos: pchar;
  7.   i1,i2, l: integer;
  8.   LocationOfBadCode: integer;
  9. begin
  10.   LocationOfBadCode := FindInvalidUTF8Character(@S[1], Length(S), True);
  11.   If LocationOfBadCode<>-1 then
  12.   begin
  13.     writeln('Invalid UTF8 string' + LineEnding
  14.        + OldPattern + LineEnding +  NewPattern);
  15.   end;  
  16. ...
and I have put a breakpoint on  writeln...
Everything seems to be valid. Removing   rfIgnoreCase did not change behaviour.

EDIT: I have insulated a troublesome string and executed it in a standalone app. Then replacements are done just fine!??!?
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on January 29, 2018, 07:45:40 pm
EDIT: I have insulated a troublesome string and executed it in a standalone app. Then replacements are done just fine!??!?[/font]

YES! You have range check activated in the main project, but not in the standalone app. That explains the exceptions with and without rfIgnoreCase.

Either way, I fell in the same bug that I was fixing. I assumed that all patterns have equal lengths, but they are not necessarily equal.

This is another version that should correct these problems:
Code: Pascal  [Select][+][-]
  1. function UTF8StringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. type
  3.   TMatch=Record
  4.     Pos,Length: integer;
  5.   end;
  6.  
  7. var
  8.   Matches: array of TMatch;
  9.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  10.   PatLength,P,Count, Capacity, i: Integer;
  11.   p1,p2, MemPos, p3: pchar;
  12.   i1,i2, l, j: integer;
  13.   OldPatLengths: integer;
  14. begin
  15.   if (Length(OldPattern)=0) or (Length(S)=0) then
  16.     exit(S);
  17.  
  18.   if rfIgnoreCase in Flags then begin
  19.     Srch := UTF8LowerCase(S);
  20.     OldPat := UTF8LowerCase(OldPattern);
  21.   end else begin
  22.     Srch := S;
  23.     OldPat := OldPattern;
  24.   end;
  25.  
  26.   { Find matches }
  27.   Count := 0;
  28.   Capacity := 10;
  29.   SetLength(Matches, Capacity);
  30.   P:=PosEx(OldPat, Srch, 1);
  31.   if P=0 then exit(s);
  32.   if rfReplaceAll in Flags then  { TODO : Consider using FindMatchesBoyerMooreCaseSensitive }
  33.     while p<>0 do begin
  34.       Matches[Count].pos := p;
  35.       Matches[Count].Length := Length(OldPat);
  36.       inc(Count);
  37.       if Count=Capacity then begin { Grow }
  38.         inc(Capacity, 10);
  39.         SetLength(Matches, Capacity);
  40.       end;
  41.       P:=PosEx(OldPat,Srch,P+1);
  42.     end
  43.   else begin
  44.       Matches[Count].pos := p;
  45.       Matches[Count].Length := Length(OldPat);
  46.       inc(Count);
  47.   end;
  48.  
  49.   if rfIgnoreCase in Flags then begin { Correct match positions and lengths }
  50.     OldPatLengths := 0;
  51.     p1 := @Srch[1]; p2 := @S[1];
  52.     for i := 0 to Count-1 do begin
  53.       MemPos := @Srch[Matches[i].pos];
  54.       while p1<MemPos do begin
  55.         inc(p1, UTF8CharacterLengthFast(p1));
  56.         inc(p2, UTF8CharacterLengthFast(p2));
  57.       end;
  58.       Matches[i].pos := p2-@S[1]+1;
  59.       Matches[i].Length:=0;
  60.  
  61.       { Get length of old pattern in S }
  62.       { Patterns do not necessarily have the same length with rfIgnoreCase }
  63.       { We assume they have the same number of codepoints }
  64.       p3 := p2;
  65.       for j := 1 to Length(OldPat) do
  66.       begin
  67.         inc(p1, UTF8CharacterLengthFast(p1));
  68.         inc(p3, UTF8CharacterLengthFast(p3));
  69.       end;
  70.       Matches[i].Length := p3-p2;
  71.       Inc(OldPatLengths, Matches[i].Length);
  72.       p2 := p3;
  73.     end;
  74.   end
  75.   else
  76.     OldPatLengths := Length(OldPat)*Count;
  77.  
  78.   PatLength := Length(OldPat);
  79.   if not (rfIgnoreCase in Flags) and (Length(NewPattern)=PatLength) then begin
  80.     //Result length will not change
  81.     Result:=S;
  82.     if (rfReplaceAll in Flags) then
  83.       for i := 0 to Count-1 do move(NewPattern[1], Result[Matches[i].pos], PatLength)
  84.     else
  85.       move(NewPattern[1], Result[Matches[0].pos], PatLength);
  86.   end else begin
  87.     SetLength(Result, Length(S) + Count*Length(NewPattern)-OldPatLengths);
  88.     if Length(Result)=0 then exit;
  89.     i1 := 1; i2 := 1;
  90.     for i := 0 to Count-1 do begin
  91.       l := Matches[i].pos-i1;
  92.       if l>0 then
  93.         move(S[i1], Result[i2], l); { Copy text before pattern }
  94.       inc(i2, l); { Move to the location of the pattern }
  95.       i1 := Matches[i].pos+Matches[i].Length; { Move over the old pattern }
  96.  
  97.       if Length(NewPattern)>0 then
  98.         move(NewPattern[1], Result[i2], Length(NewPattern));  { Copy the new pattern }
  99.       inc(i2, Length(NewPattern));
  100.     end;
  101.     if (i1>0) and (i1<=Length(S)) then
  102.       move(S[i1], Result[i2], Length(S)-i1+1); { Copy leftover text }
  103.   end;
  104. end;

It does need testing.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on January 30, 2018, 08:36:04 am
Lots of thanks, so far it seems to work fine!!!
I will report immediately if I see it misbehaving. Or within 2 weeks if I see that everything is fine.
Title: Re: StringReplace extremely slow with big strings
Post by: CM630 on March 26, 2018, 10:35:10 am
...Or within 2 weeks if I see that everything is fine.
I as have promised to provide feedback- so far I have no issues with this code.
Title: Re: StringReplace extremely slow with big strings
Post by: engkin on March 26, 2018, 10:59:55 pm
Thanks for the feedback.
TinyPortal © 2005-2018