Function NewStringReplace(const S, OldPattern, NewPattern: string; Flags: TReplaceFlags): string;
var
OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
PatLength,NewPatLength,P,i,PatCount,PrevP: Integer;
c,d: pchar;
begin
PatLength:=Length(OldPattern);
if PatLength=0 then begin
Result:=S;
exit;
end;
if rfIgnoreCase in Flags then begin
Srch:=AnsiUpperCase(S);
OldPat:=AnsiUpperCase(OldPattern);
end else begin
Srch:=S;
OldPat:=OldPattern;
end;
PatLength:=Length(OldPat);
if Length(NewPattern)=PatLength then begin
//Result length will not change
Result:=S;
P:=1;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
for i:=1 to PatLength do
Result[P+i-1]:=NewPattern[i];
if not (rfReplaceAll in Flags) then exit;
inc(P,PatLength);
end;
until p=0;
end else begin
//Different pattern length -> Result length will change
//To avoid creating a lot of temporary strings, we count how many
//replacements we're going to make.
P:=1; PatCount:=0;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
inc(P,PatLength);
inc(PatCount);
if not (rfReplaceAll in Flags) then break;
end;
until p=0;
if PatCount=0 then begin
Result:=S;
exit;
end;
NewPatLength:=Length(NewPattern);
SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
P:=1; PrevP:=0;
c:=pchar(Result); d:=pchar(S);
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
for i:=PrevP+1 to P-1 do begin
c^:=d^;
inc(c); inc(d);
end;
for i:=1 to NewPatLength do begin
c^:=NewPattern[i];
inc(c);
end;
if not (rfReplaceAll in Flags) then exit;
inc(P,PatLength);
inc(d,PatLength);
PrevP:=P-1;
end else begin
for i:=PrevP+1 to Length(S) do begin
c^:=d^;
inc(c); inc(d);
end;
end;
until p=0;
end;
end;
It'd be nice to get this one or an even more improved one to the FPC source tree.Post it as a patch to the bugtracker. I'm not sure whether we have unit testing for rtl units, someone else might be able to confirm this and you can run your code against the unit tests. If it generates the same result as the original one, your code is likely to be accepted.
for i:=1 to PatLength do
Result[P+i-1]:=NewPattern[i];
Into move(NewPattern[1], Result[P], PatLength);
for i:=1 to NewPatLength do begin
c^:=NewPattern[i];
inc(c);
end;
Into Move(NewPattern[1],c^,NewPatLength);
inc(c, NewPatLength);
if not (rfReplaceAll in Flags) then exit;
Will not copy the end of the stringFunction NewStringReplace(const S, OldPattern, NewPattern: string; Flags: TReplaceFlags): string;
var
OldPat,Srch: string; // Srch and Oldp can contain uppercase versions of S,OldPattern
PatLength,NewPatLength,P,Cnt,PatCount,PrevP: Integer;
c,d: pchar;
begin
PatLength:=Length(OldPattern);
if PatLength=0 then begin
Result:=S;
exit;
end;
if rfIgnoreCase in Flags then begin
Srch:=AnsiUpperCase(S);
OldPat:=AnsiUpperCase(OldPattern);
end else begin
Srch:=S;
OldPat:=OldPattern;
end;
PatLength:=Length(OldPat);
if Length(NewPattern)=PatLength then begin
//Result length will not change
Result:=S;
P:=1;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
move(NewPattern[1],Result[P],PatLength);
if not (rfReplaceAll in Flags) then exit;
inc(P,PatLength);
end;
until p=0;
end else begin
//Different pattern length -> Result length will change
//To avoid creating a lot of temporary strings, we count how many
//replacements we're going to make.
P:=1; PatCount:=0;
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
inc(P,PatLength);
inc(PatCount);
if not (rfReplaceAll in Flags) then break;
end;
until p=0;
if PatCount=0 then begin
Result:=S;
exit;
end;
NewPatLength:=Length(NewPattern);
SetLength(Result,Length(S)+PatCount*(NewPatLength-PatLength));
P:=1; PrevP:=0;
c:=pchar(Result); d:=pchar(S);
repeat
P:=PosEx(OldPat,Srch,P);
if P>0 then begin
Cnt:=P-PrevP-1;
if Cnt>0 then begin
Move(d^,c^,Cnt);
inc(c,Cnt);
inc(d,Cnt);
end;
if NewPatLength>0 then begin
Move(NewPattern[1],c^,NewPatLength);
inc(c,NewPatLength);
end;
inc(P,PatLength);
inc(d,PatLength);
PrevP:=P-1;
if not (rfReplaceAll in Flags) then break;
end;
until p=0;
Cnt:=Length(S)-PrevP;
if Cnt>0 then Move(d^,c^,Cnt);
end;
end;
Before submitting to the bugtracker, I wonder if it's ok that the function uses posex?If the unit posex is in is already referenced by/present in the unit stringreplace is in (sorry to lazy to look it up now) I don't see a problem?
@jarto: Is there also a UTF version (NewStringReplaceUTF8) and ANSI version (NewAnsiStringReplace) in the pipeline? They could replace the ones in the fileutil unit.
I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.
In my view, if it is a case sensitive search, it works with UTF8 too, because each UTF8 character start with a byte that indicate its length and the next bytes are always in a certain range. So when providing an UTF8 search string and an UTF8 pattern, it is not possible to identify a pattern without matching the beginning and ending of characters.I'm not familiar enough with what's exactly needed, but basically that one works with any data. It can be UTF8, binary data, whatever.I suspect not. UTF8 e.g. often has 1 byte per character but can have multiples. That needs to be taken into account.
uses Classes, SysUtils, StrUtils, dateutils;
...function NewStringReplace...
var
t: TDateTime;
src : string;
i : integer;
tmp : string;
l : string;
begin
tmp:='aaaaaaaab';
l:=Copy(tmp,1,length(tmp)-1);
src:='';
for i:=0 to 500000 do src:=src+l;
writeln('prepared!');
src:=src+tmp;
t:=now;
StringReplace(src,tmp,'*',[rfReplaceAll]);
writeln('rtl: ',MilliSecondsBetween(now,t));
t:=now;
NewStringReplace(src,tmp,'*',[rfReplaceAll]);
writeln('new: ', MilliSecondsBetween(now,t));
end.
you need more efficient substring search than PosEx().
If the unit posex is in is already referenced by/present in the unit stringreplace is in (sorry to lazy to look it up now) I don't see a problem?New StringReplace use PosEx, which in StrUtils, which uses SysUtils, which contain StringReplace.
If you replace ansi with ansi, plz do NOT use string in Lazarus: that is UTF8. Rename string to AnsiString.Thaddy, in Lazarus by default String = AnsiString.
2. CodeUnit is some type of variable. But I could no find more info about it.No. It is a concept of Unicode. It is the smallest atomic building block. A code point can consist of one or more code units.
3. My current (old) implementation of StringReplaceNew messes the strings up.
I believe you have a problem with rfIgnoreCase. Your code assumes that Srch and S have the same length. Upper case and lower case do *not* necessarily have the same length.True. It also means functions UTF8SwapCase and my recently added UTF8ProperCase in unit LazUTF8 are buggy.
UTF8StringReplace from unit LazUTF8 has the same bug. It uses UTF8LowerCase instead.Ok, I didn't realize that one. The function has been there forever.
You did not show the current implementation. I assume something like:I did in post 20 (first function). Or you mean sth. else?
I think your belief is right! I removed all rfIgnoreCase and now I do not notice any corruption of the string (but I cannot say for sure that there is not any).
I believe you have a problem with rfIgnoreCase. Your code assumes that Srch and S have the same length. Upper case and lower case do *not* necessarily have the same length. Check the source code for UTF8UpperCase:
On the positive side the lengths are the same in most cases but that is no excuse for a bug of course.
Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.
Thank you! Unicode Case Charts (http://www.unicode.org/charts/case/index.html) still show that the upper case for ß is SS.Using upper case causes another bug. Some code points produce more than one character. For instance the upper case for German ligature sharp s (U+00DF ß) is SS.
Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS ;) (more info here (https://en.wikipedia.org/wiki/Capital_%E1%BA%9E))
Here is a possible replacement for UTF8StringReplace. It needs to be tested...Tested:
TApplication.HandleException Range check error
Stack trace:
$004670AA STRINGREPLACENEW, line 966 of CommonXMLr.pas
$004306EF TFRMMAIN__OPENXML, line 1976 of frmMainSRC.pas
$0043C837 TFRMMAIN__FORMDROPFILES, line 3043 of frmMainSRC.pas
$0041F0E8 TCUSTOMFORM__INTFDROPFILES, line 2619 of ./include/customfor
$00522AD7 TWINDOWPROCHELPER__HANDLEDROPFILES, line 1107 of ./win32/win
$00525136 TWINDOWPROCHELPER__DOWINDOWPROC, line 2246 of ./win32/win32c
$00525F3B WINDOWPROC, line 2657 of ./win32/win32callback.inc
$005FE933 CUSTOMFORMWNDPROC, line 386 of ./win32/win32wsforms.pp
$76D062FA
$76D06D3A
$76D077C4
$76D0788A
$00526DAD TWIN32WIDGETSET__APPPROCESSMESSAGES, line 407 of ./win32/win
$0042476D TAPPLICATION__HANDLEMESSAGE, line 1276 of ./include/applicat
$00424B8E TAPPLICATION__RUNLOOP, line 1413 of ./include/application.in
$00470E40 TWIDGETSET__APPRUN, line 54 of ./include/interfacebase.inc
$00424B4E TAPPLICATION__RUN, line 1401 of ./include/application.inc
This happens after several replacements, I will try to remove them one by one.Well, since June 2017 it's officially allowed to use the Capital Sharp S (ẞ, U+1E9E) in German instead of SS (more info here)Capital Eszett sound pretty much to me like a capital „ь‟ (small er) :D
$004670AA STRINGREPLACENEW, line 966 of CommonXMLr.pas
Quote$004670AA STRINGREPLACENEW, line 966 of CommonXMLr.pas
I assume you changed its name to STRINGREPLACENEW.
Which line is 966?Exactly
Or your text is not valid UTF8. Can you change UTF8CharacterLengthFast to UTF8CharacterLength and try again?Same behaviour. But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.
But I have no idea if my text is valid UTF8, I think it is, but I have no idea how to detect it.Use FindInvalidUTF8Character from LazUTF8. Something like:
EDIT: I have insulated a troublesome string and executed it in a standalone app. Then replacements are done just fine!??!?[/font]
...Or within 2 weeks if I see that everything is fine.I as have promised to provide feedback- so far I have no issues with this code.