Recent

Author Topic: StringReplace extremely slow with big strings  (Read 32116 times)

jarto

  • Full Member
  • ***
  • Posts: 106
StringReplace extremely slow with big strings
« on: October 14, 2014, 09:24:36 pm »
I noticed this programming with Delphi 2007, but FPC does suffer from the same problem: StringReplace is awful with long strings. I ran into this in real world while wondering how Indy's headers could be so slow with big POST buffers. It converted '+' to ' ' with StringReplace.

When I looked at the code, the reason was obvious:  StringReplace does a lot of string manipulation, copy and concatenation. It gets really ugly fast, when there's a lot to replace and the string is big.

I wrote a new version, which is a lot more optimized. It can certainly be optimized more, but the general idea is to calculate the length of the result string, set the length once and only manipulate the contents of it.

Here are some test results:

58058 bytes Single char, replace all 203 x faster
58058 bytes Same length, replace all 61 x faster
58058 bytes Single char, replace all, ignore case 68 x faster
58058 bytes Same length, replace all, ignore case 34 x faster
58058 bytes New longer, replace all 32 x faster
58058 bytes New shorter, replace all 32 x faster
58058 bytes New longer, replace all, ignore case 22 x faster
58058 bytes New shorter, replace all, ignore case 22 x faster
58058 bytes New empty, replace all 35 x faster
58058 bytes New empty, replace all, ignore case 23 x faster
116058 bytes Single char, replace all 260 x faster
116058 bytes Same length, replace all 148 x faster
116058 bytes Single char, replace all, ignore case 153 x faster
116058 bytes Same length, replace all, ignore case 75 x faster
116058 bytes New longer, replace all 62 x faster
116058 bytes New shorter, replace all 61 x faster
116058 bytes New longer, replace all, ignore case 45 x faster
116058 bytes New shorter, replace all, ignore case 44 x faster
116058 bytes New empty, replace all 68 x faster
116058 bytes New empty, replace all, ignore case 50 x faster
174058 bytes Single char, replace all 518 x faster
174058 bytes Same length, replace all 228 x faster
174058 bytes Single char, replace all, ignore case 219 x faster
174058 bytes Same length, replace all, ignore case 113 x faster
174058 bytes New longer, replace all 100 x faster
174058 bytes New shorter, replace all 104 x faster
174058 bytes New longer, replace all, ignore case 70 x faster
174058 bytes New shorter, replace all, ignore case 72 x faster
174058 bytes New empty, replace all 107 x faster
174058 bytes New empty, replace all, ignore case 80 x faster
232058 bytes Single char, replace all 634 x faster
232058 bytes Same length, replace all 291 x faster
232058 bytes Single char, replace all, ignore case 305 x faster
232058 bytes Same length, replace all, ignore case 152 x faster
232058 bytes New longer, replace all 141 x faster
232058 bytes New shorter, replace all 140 x faster
232058 bytes New longer, replace all, ignore case 98 x faster
232058 bytes New shorter, replace all, ignore case 98 x faster
232058 bytes New empty, replace all 153 x faster
232058 bytes New empty, replace all, ignore case 107 x faster
290058 bytes Single char, replace all 1043 x faster
290058 bytes Same length, replace all 487 x faster
290058 bytes Single char, replace all, ignore case 542 x faster
290058 bytes Same length, replace all, ignore case 311 x faster
290058 bytes New longer, replace all 254 x faster
290058 bytes New shorter, replace all 249 x faster
290058 bytes New longer, replace all, ignore case 183 x faster
290058 bytes New shorter, replace all, ignore case 179 x faster
290058 bytes New empty, replace all 277 x faster
290058 bytes New empty, replace all, ignore case 197 x faster

And the new code is: (Update: This original code has a bug in it, look below for a fixed and improved version)

Code: [Select]
Function NewStringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  PatLength,NewPatLength,P,i,PatCount,PrevP: Integer;
  c,d: pchar;
begin
  PatLength:=Length(OldPattern);
  if PatLength=0 then begin
    Result:=S;
    exit;
  end;

  if rfIgnoreCase in Flags then begin
    Srch:=AnsiUpperCase(S);
    OldPat:=AnsiUpperCase(OldPattern);
  end else begin
    Srch:=S;
    OldPat:=OldPattern;
  end;

  PatLength:=Length(OldPat);
  if Length(NewPattern)=PatLength then begin
    //Result length will not change
    Result:=S;
    P:=1;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        for i:=1 to PatLength do
          Result[P+i-1]:=NewPattern[i];
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
      end;
    until p=0;
  end else begin
    //Different pattern length -> Result length will change
    //To avoid creating a lot of temporary strings, we count how many
    //replacements we're going to make.
    P:=1; PatCount:=0;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        inc(P,PatLength);
        inc(PatCount);
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    if PatCount=0 then begin
      Result:=S;
      exit;
    end;
    NewPatLength:=Length(NewPattern);
    SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
    P:=1; PrevP:=0;
    c:=pchar(Result); d:=pchar(S);
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        for i:=PrevP+1 to P-1 do begin
          c^:=d^;
          inc(c); inc(d);
        end;
        for i:=1 to NewPatLength do begin
          c^:=NewPattern[i];
          inc(c);
        end;
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
        inc(d,PatLength);
        PrevP:=P-1;
      end else begin
        for i:=PrevP+1 to Length(S) do begin
          c^:=d^;
          inc(c); inc(d);
        end;
      end;
    until p=0;
  end;
end;

Comments are welcome. It'd be nice to get this one or an even more improved one to the FPC source tree.
« Last Edit: October 15, 2014, 04:54:27 am by jarto »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8746
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: StringReplace extremely slow with big strings
« Reply #1 on: October 14, 2014, 11:15:25 pm »
Quote
It'd be nice to get this one or an even more improved one to the FPC source tree.
Post it as a patch to the bugtracker. I'm not sure whether we have unit testing for rtl units, someone else might be able to confirm this and you can run your code against the unit tests. If it generates the same result as the original one, your code is likely to be accepted.

circular

  • Hero Member
  • *****
  • Posts: 4195
    • Personal webpage
Re: StringReplace extremely slow with big strings
« Reply #2 on: October 15, 2014, 12:14:34 am »
Hi,

I would propose to change this:
Code: [Select]
        for i:=1 to PatLength do
          Result[P+i-1]:=NewPattern[i];
Into
Code: [Select]
        move(NewPattern[1], Result[P], PatLength);
And
Code: [Select]
        for i:=1 to NewPatLength do begin
          c^:=NewPattern[i];
          inc(c);
        end;
Into
Code: [Select]
        Move(NewPattern[1],c^,NewPatLength);
        inc(c, NewPatLength);

Also I think that
Code: [Select]
if not (rfReplaceAll in Flags) then exit;Will not copy the end of the string

Also the copying of the end of the string can also be done with Move
« Last Edit: October 15, 2014, 12:17:00 am by circular »
Conscience is the debugger of the mind

jarto

  • Full Member
  • ***
  • Posts: 106
Re: StringReplace extremely slow with big strings
« Reply #3 on: October 15, 2014, 04:56:42 am »
circular, thank you for your suggestions and for noticing how the "replace one" -case didn't copy until the end of the string. Here's the updated code.

Before submitting to the bugtracker, I wonder if it's ok that the function uses posex?

Code: [Select]
Function NewStringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
  c,d: pchar;
begin
  PatLength:=Length(OldPattern);
  if PatLength=0 then begin
    Result:=S;
    exit;
  end;

  if rfIgnoreCase in Flags then begin
    Srch:=AnsiUpperCase(S);
    OldPat:=AnsiUpperCase(OldPattern);
  end else begin
    Srch:=S;
    OldPat:=OldPattern;
  end;

  PatLength:=Length(OldPat);
  if Length(NewPattern)=PatLength then begin
    //Result length will not change
    Result:=S;
    P:=1;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        move(NewPattern[1],Result[P],PatLength);
        if not (rfReplaceAll in Flags) then exit;
        inc(P,PatLength);
      end;
    until p=0;
  end else begin
    //Different pattern length -> Result length will change
    //To avoid creating a lot of temporary strings, we count how many
    //replacements we're going to make.
    P:=1; PatCount:=0;
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        inc(P,PatLength);
        inc(PatCount);
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    if PatCount=0 then begin
      Result:=S;
      exit;
    end;
    NewPatLength:=Length(NewPattern);
    SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
    P:=1; PrevP:=0;
    c:=pchar(Result); d:=pchar(S);
    repeat
      P:=PosEx(OldPat,Srch,P);
      if P>0 then begin
        Cnt:=P-PrevP-1;
        if Cnt>0 then begin
          Move(d^,c^,Cnt);
          inc(c,Cnt);
          inc(d,Cnt);
        end;
        if NewPatLength>0 then begin
          Move(NewPattern[1],c^,NewPatLength);
          inc(c,NewPatLength);
        end;
        inc(P,PatLength);
        inc(d,PatLength);
        PrevP:=P-1;
        if not (rfReplaceAll in Flags) then break;
      end;
    until p=0;
    Cnt:=Length(S)-PrevP;
    if Cnt>0 then Move(d^,c^,Cnt);
  end;
end;

minesadorada

  • Sr. Member
  • ****
  • Posts: 452
  • Retired
Re: StringReplace extremely slow with big strings
« Reply #4 on: October 15, 2014, 08:30:18 am »
@jarto: Is there also a UTF version (NewStringReplaceUTF8) and ANSI version (NewAnsiStringReplace) in the pipeline?  They could replace the ones in the fileutil unit.
GPL Apps: Health MonitorRetro Ski Run
OnlinePackageManager Components: LazAutoUpdate, LongTimer, PoweredBy, ScrollText, PlaySound, CryptINI

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: StringReplace extremely slow with big strings
« Reply #5 on: October 15, 2014, 08:44:46 am »
Before submitting to the bugtracker, I wonder if it's ok that the function uses posex?
If the unit posex is in is already referenced by/present in the unit stringreplace is in (sorry to lazy to look it up now) I don't see a problem?
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

jarto

  • Full Member
  • ***
  • Posts: 106
Re: StringReplace extremely slow with big strings
« Reply #6 on: October 15, 2014, 09:17:56 am »
@jarto: Is there also a UTF version (NewStringReplaceUTF8) and ANSI version (NewAnsiStringReplace) in the pipeline?  They could replace the ones in the fileutil unit.

I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever. Unless you call it with rfIgnoreCase, of course.

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: StringReplace extremely slow with big strings
« Reply #7 on: October 15, 2014, 09:35:16 am »
I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.
I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.

However, I'd focus first on the current version you have and see what the fpc devs have to say when you submit your patch...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: StringReplace extremely slow with big strings
« Reply #8 on: October 15, 2014, 01:59:07 pm »
BTW see http://bugs.freepascal.org/view.php?id=22501 where alternative stringreplace functions in conjunction with FPC trunk are discussed.
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

circular

  • Hero Member
  • *****
  • Posts: 4195
    • Personal webpage
Re: StringReplace extremely slow with big strings
« Reply #9 on: October 15, 2014, 05:16:55 pm »
I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.
I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.
In my view, if it is a case sensitive search, it works with UTF8 too, because each UTF8 character start with a byte that indicate its length and the next bytes are always in a certain range. So when providing an UTF8 search string and an UTF8 pattern, it is not possible to identify a pattern without matching the beginning and ending of characters.

However, this makes a difference when the search is not case sensitive. The resulting string will certainly be corrupted if the search pattern or search string contains non ASCII characters when the AnsiUpperCase is applied on an UTF8 string.

Also it will not work for sure with widestrings.
Conscience is the debugger of the mind

minesadorada

  • Sr. Member
  • ****
  • Posts: 452
  • Retired
Re: StringReplace extremely slow with big strings
« Reply #10 on: October 15, 2014, 05:33:24 pm »
In StrUtils,
AnsiReplaceText just calls StringReplace with the flagset [rfReplaceAll,rfIgnoreCase]

=edit: oops! I meant StrUtils==
« Last Edit: October 15, 2014, 06:11:26 pm by minesadorada »
GPL Apps: Health MonitorRetro Ski Run
OnlinePackageManager Components: LazAutoUpdate, LongTimer, PoweredBy, ScrollText, PlaySound, CryptINI

jarto

  • Full Member
  • ***
  • Posts: 106
Re: StringReplace extremely slow with big strings
« Reply #11 on: October 15, 2014, 05:54:05 pm »
Well, it's submitted. Let's see if it's accepted.

http://bugs.freepascal.org/view.php?id=26864

User137

  • Hero Member
  • *****
  • Posts: 1791
    • Nxpascal home
Re: StringReplace extremely slow with big strings
« Reply #12 on: October 15, 2014, 09:57:18 pm »
Would it be faster if you would save the search results in TList (faster than dynamic array for this case) from first pass? Just in theory, it could be more optimal. You would need to store the PosEx result, an integer.

jarto

  • Full Member
  • ***
  • Posts: 106
Re: StringReplace extremely slow with big strings
« Reply #13 on: October 16, 2014, 05:57:17 pm »
I'm not sure that StringReplace should be more complicated to the point of using a TList to store values. It would probably cause it to be slower with short strings while speeding up faster ones.

skalogryz

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2770
    • havefunsoft.com
Re: StringReplace extremely slow with big strings
« Reply #14 on: October 16, 2014, 06:53:57 pm »
you need more efficient substring search than PosEx().
Check this out.
Code: [Select]
uses Classes, SysUtils, StrUtils, dateutils;

...function NewStringReplace...

var
  t: TDateTime;
  src : string;
  i   : integer;
  tmp : string;
  l   : string;
begin
  tmp:='aaaaaaaab';
  l:=Copy(tmp,1,length(tmp)-1);
  src:='';
  for i:=0 to 500000 do src:=src+l;
  writeln('prepared!');
  src:=src+tmp;

   t:=now;
   StringReplace(src,tmp,'*',[rfReplaceAll]);
   writeln('rtl: ',MilliSecondsBetween(now,t));


   t:=now;
   NewStringReplace(src,tmp,'*',[rfReplaceAll]);
   writeln('new: ', MilliSecondsBetween(now,t));
end.
« Last Edit: October 16, 2014, 06:55:34 pm by skalogryz »

 

TinyPortal © 2005-2018