Recent

Author Topic: StringReplace extremely slow with big strings  (Read 32125 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 14205
  • Probably until I exterminate Putin.
Re: StringReplace extremely slow with big strings
« Reply #30 on: January 27, 2018, 09:27:19 am »
engkin given
Code: Pascal  [Select][+][-]
  1. { TODO : Consider using FindMatchesBoyerMooreCaseSensitive }
Note that in trunk the BoyerMoore routines are now functions that return true if there are matches or false otherwise.
May simplify using it. They have also been made faster recently.
Specialize a type, not a var.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #31 on: January 27, 2018, 01:20:01 pm »
Thank you Thaddy. I should use the trunk version then.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #32 on: January 28, 2018, 07:47:47 pm »
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Lazarus IDE (which uses UTF8UpperCase) does follow this rule and changes ß to SS.

Similar ligatures that should expand to a few characters:
ff fi fl ffi ffl ſt st (U+FB00 to U+FB06)

UTF8UpperCase (Not sure if by design or a bug?) does not expand them. But it does expand ß.

I am not aware of any codepoint that gives more than one codepoint for its lower case.

PascalDragon

  • Hero Member
  • *****
  • Posts: 5446
  • Compiler Developer
Re: StringReplace extremely slow with big strings
« Reply #33 on: January 28, 2018, 08:46:51 pm »
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS ;) (more info here)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #34 on: January 29, 2018, 06:37:24 am »
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.

Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS ;) (more info here)
Thank you! Unicode Case Charts still show that the upper case for ß is SS.

CM630

  • Hero Member
  • *****
  • Posts: 1082
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: StringReplace extremely slow with big strings
« Reply #35 on: January 29, 2018, 06:59:47 am »
Here is a possible replacement for UTF8StringReplace. It needs to be tested...
Tested:
Code: [Select]
TApplication.HandleException Range check error
  Stack trace:
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas
  $004306EF  TFRMMAIN__OPENXML,  line 1976 of frmMainSRC.pas
  $0043C837  TFRMMAIN__FORMDROPFILES,  line 3043 of frmMainSRC.pas
  $0041F0E8  TCUSTOMFORM__INTFDROPFILES,  line 2619 of ./include/customfor
  $00522AD7  TWINDOWPROCHELPER__HANDLEDROPFILES,  line 1107 of ./win32/win
  $00525136  TWINDOWPROCHELPER__DOWINDOWPROC,  line 2246 of ./win32/win32c
  $00525F3B  WINDOWPROC,  line 2657 of ./win32/win32callback.inc
  $005FE933  CUSTOMFORMWNDPROC,  line 386 of ./win32/win32wsforms.pp
  $76D062FA
  $76D06D3A
  $76D077C4
  $76D0788A
  $00526DAD  TWIN32WIDGETSET__APPPROCESSMESSAGES,  line 407 of ./win32/win
  $0042476D  TAPPLICATION__HANDLEMESSAGE,  line 1276 of ./include/applicat
  $00424B8E  TAPPLICATION__RUNLOOP,  line 1413 of ./include/application.in
  $00470E40  TWIDGETSET__APPRUN,  line 54 of ./include/interfacebase.inc
  $00424B4E  TAPPLICATION__RUN,  line 1401 of ./include/application.inc
This happens after several replacements, I will try to remove them one by one.
Unfortunately I cannot share my text files (since they are not mine).


Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS  (more info here)
Capital Eszett sound pretty much to me like a capital „ь‟ (small er)  :D
« Last Edit: January 29, 2018, 07:03:55 am by CM630 »
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #36 on: January 29, 2018, 07:03:51 am »
Quote
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas

I assume you changed its name to STRINGREPLACENEW. Which line is 966?

CM630

  • Hero Member
  • *****
  • Posts: 1082
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: StringReplace extremely slow with big strings
« Reply #37 on: January 29, 2018, 07:05:10 am »
Quote
  $004670AA  STRINGREPLACENEW,  line 966 of CommonXMLr.pas

I assume you changed its name to STRINGREPLACENEW.

Precisely.


Which line is 966?
Exactly
Code: Pascal  [Select][+][-]
  1.       move(NewPattern[1], Result[i2], Length(NewPattern));  { Copy the new pattern }
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #38 on: January 29, 2018, 07:09:23 am »
It seems that Result does not have the correct size. \

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #39 on: January 29, 2018, 07:14:12 am »
Or your text is not valid UTF8. Can you change UTF8CharacterLengthFast to UTF8CharacterLength and try again?

CM630

  • Hero Member
  • *****
  • Posts: 1082
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: StringReplace extremely slow with big strings
« Reply #40 on: January 29, 2018, 07:22:50 am »
Or your text is not valid UTF8. Can you change UTF8CharacterLengthFast to UTF8CharacterLength and try again?
Same behaviour. But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.

I wrapped Line 966 in try ... except writeln oldpattern... new pattern, and I got that this happens when I am executing this line:
    FileContents:=StringReplaceNew(FileContents, ' cellpadding="0" cellspacing="0"',    '', ReplaceParams);
I commented it out and now the app does not crash.
I have some 20 calls of the function StringReplaceNew before it and they seem to be done okay (at least do not cause a crash).

EDIT: I tried

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   FileContents: string;
  4. begin
  5.   FileContents:= 'алабала cellpadding="0" cellspacing="0" тралала' + LineEnding + 'алабала cellpadding="0" cellspacing="0" тралала' ;
  6.   FileContents:=StringReplaceNew(FileContents, ' cellpadding="0" cellspacing="0"',    '',[rfReplaceAll, rfIgnoreCase]);
  7.   ShowMessage (FileContents);
  8. end;
and it works fine.

Edit: In other files, which had no problems with the previous StringReplaceNew routine I started having troubles. Not only with cellpadding...
« Last Edit: January 29, 2018, 07:44:30 am by CM630 »
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #41 on: January 29, 2018, 07:56:48 am »
But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.
Use FindInvalidUTF8Character from LazUTF8. Something like:
Code: Pascal  [Select][+][-]
  1.   LocationOfBadCode := FindInvalidUTF8Character(@FileContents[1], Length(FileContents), True);
  2.   If LocationOfBadCode<>-1 then
  3.      WriteLn('Invalid UTF8 string');

Edit:
If the text is valid, does the problem happen with and without rfIgnoreCase?
« Last Edit: January 29, 2018, 07:59:36 am by engkin »

CM630

  • Hero Member
  • *****
  • Posts: 1082
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: StringReplace extremely slow with big strings
« Reply #42 on: January 29, 2018, 08:20:52 am »
I did:


Code: Pascal  [Select][+][-]
  1. function StringReplaceNew(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. var
  3.   Matches: array of integer;
  4.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  5.   PatLength,P,Count, Capacity, i: Integer;
  6.   p1,p2, MemPos: pchar;
  7.   i1,i2, l: integer;
  8.   LocationOfBadCode: integer;
  9. begin
  10.   LocationOfBadCode := FindInvalidUTF8Character(@S[1], Length(S), True);
  11.   If LocationOfBadCode<>-1 then
  12.   begin
  13.     writeln('Invalid UTF8 string' + LineEnding
  14.        + OldPattern + LineEnding +  NewPattern);
  15.   end;  
  16. ...
and I have put a breakpoint on  writeln...
Everything seems to be valid. Removing   rfIgnoreCase did not change behaviour.

EDIT: I have insulated a troublesome string and executed it in a standalone app. Then replacements are done just fine!??!?
« Last Edit: January 29, 2018, 09:11:27 am by CM630 »
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: StringReplace extremely slow with big strings
« Reply #43 on: January 29, 2018, 07:45:40 pm »
EDIT: I have insulated a troublesome string and executed it in a standalone app. Then replacements are done just fine!??!?[/font]

YES! You have range check activated in the main project, but not in the standalone app. That explains the exceptions with and without rfIgnoreCase.

Either way, I fell in the same bug that I was fixing. I assumed that all patterns have equal lengths, but they are not necessarily equal.

This is another version that should correct these problems:
Code: Pascal  [Select][+][-]
  1. function UTF8StringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
  2. type
  3.   TMatch=Record
  4.     Pos,Length: integer;
  5.   end;
  6.  
  7. var
  8.   Matches: array of TMatch;
  9.   OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
  10.   PatLength,P,Count, Capacity, i: Integer;
  11.   p1,p2, MemPos, p3: pchar;
  12.   i1,i2, l, j: integer;
  13.   OldPatLengths: integer;
  14. begin
  15.   if (Length(OldPattern)=0) or (Length(S)=0) then
  16.     exit(S);
  17.  
  18.   if rfIgnoreCase in Flags then begin
  19.     Srch := UTF8LowerCase(S);
  20.     OldPat := UTF8LowerCase(OldPattern);
  21.   end else begin
  22.     Srch := S;
  23.     OldPat := OldPattern;
  24.   end;
  25.  
  26.   { Find matches }
  27.   Count := 0;
  28.   Capacity := 10;
  29.   SetLength(Matches, Capacity);
  30.   P:=PosEx(OldPat, Srch, 1);
  31.   if P=0 then exit(s);
  32.   if rfReplaceAll in Flags then  { TODO : Consider using FindMatchesBoyerMooreCaseSensitive }
  33.     while p<>0 do begin
  34.       Matches[Count].pos := p;
  35.       Matches[Count].Length := Length(OldPat);
  36.       inc(Count);
  37.       if Count=Capacity then begin { Grow }
  38.         inc(Capacity, 10);
  39.         SetLength(Matches, Capacity);
  40.       end;
  41.       P:=PosEx(OldPat,Srch,P+1);
  42.     end
  43.   else begin
  44.       Matches[Count].pos := p;
  45.       Matches[Count].Length := Length(OldPat);
  46.       inc(Count);
  47.   end;
  48.  
  49.   if rfIgnoreCase in Flags then begin { Correct match positions and lengths }
  50.     OldPatLengths := 0;
  51.     p1 := @Srch[1]; p2 := @S[1];
  52.     for i := 0 to Count-1 do begin
  53.       MemPos := @Srch[Matches[i].pos];
  54.       while p1<MemPos do begin
  55.         inc(p1, UTF8CharacterLengthFast(p1));
  56.         inc(p2, UTF8CharacterLengthFast(p2));
  57.       end;
  58.       Matches[i].pos := p2-@S[1]+1;
  59.       Matches[i].Length:=0;
  60.  
  61.       { Get length of old pattern in S }
  62.       { Patterns do not necessarily have the same length with rfIgnoreCase }
  63.       { We assume they have the same number of codepoints }
  64.       p3 := p2;
  65.       for j := 1 to Length(OldPat) do
  66.       begin
  67.         inc(p1, UTF8CharacterLengthFast(p1));
  68.         inc(p3, UTF8CharacterLengthFast(p3));
  69.       end;
  70.       Matches[i].Length := p3-p2;
  71.       Inc(OldPatLengths, Matches[i].Length);
  72.       p2 := p3;
  73.     end;
  74.   end
  75.   else
  76.     OldPatLengths := Length(OldPat)*Count;
  77.  
  78.   PatLength := Length(OldPat);
  79.   if not (rfIgnoreCase in Flags) and (Length(NewPattern)=PatLength) then begin
  80.     //Result length will not change
  81.     Result:=S;
  82.     if (rfReplaceAll in Flags) then
  83.       for i := 0 to Count-1 do move(NewPattern[1], Result[Matches[i].pos], PatLength)
  84.     else
  85.       move(NewPattern[1], Result[Matches[0].pos], PatLength);
  86.   end else begin
  87.     SetLength(Result, Length(S) + Count*Length(NewPattern)-OldPatLengths);
  88.     if Length(Result)=0 then exit;
  89.     i1 := 1; i2 := 1;
  90.     for i := 0 to Count-1 do begin
  91.       l := Matches[i].pos-i1;
  92.       if l>0 then
  93.         move(S[i1], Result[i2], l); { Copy text before pattern }
  94.       inc(i2, l); { Move to the location of the pattern }
  95.       i1 := Matches[i].pos+Matches[i].Length; { Move over the old pattern }
  96.  
  97.       if Length(NewPattern)>0 then
  98.         move(NewPattern[1], Result[i2], Length(NewPattern));  { Copy the new pattern }
  99.       inc(i2, Length(NewPattern));
  100.     end;
  101.     if (i1>0) and (i1<=Length(S)) then
  102.       move(S[i1], Result[i2], Length(S)-i1+1); { Copy leftover text }
  103.   end;
  104. end;

It does need testing.
« Last Edit: January 29, 2018, 09:39:42 pm by engkin »

CM630

  • Hero Member
  • *****
  • Posts: 1082
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: StringReplace extremely slow with big strings
« Reply #44 on: January 30, 2018, 08:36:04 am »
Lots of thanks, so far it seems to work fine!!!
I will report immediately if I see it misbehaving. Or within 2 weeks if I see that everything is fine.
Лазар 3,2 32 bit (sometimes 64 bit); FPC3,2,2; rev: Lazarus_3_0 on Win10 64bit.

 

TinyPortal © 2005-2018