Recent

Author Topic: Case insensitive search and replace functions for strings.  (Read 27990 times)

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Case insensitive search and replace functions for strings.
« on: April 18, 2012, 02:47:15 pm »
How am I to make a case insensitive search in a string?
PosEx cannot do the job, so i did
PosEx (UTF8LowerCase  (string1),UTF8LowerCase  (string2), 1)
Maybe there is a better way?

Yet another question is how to replace a substring i a string.
AnsiReplaceText is case insensitve according to http://lazarus-ccr.sourceforge.net/docs/rtl/strutils/index-5.html , but that is not true, or if it is- it is only for some basic Latin letters.
Shall I do the lowercase thing again?
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #1 on: April 18, 2012, 03:25:33 pm »
Not sure if they are any ready made solutions, so your ideas should work.
A couple of notes.

Quote
PosEx cannot do the job, so i did

If your dealing with UTF8 string, then I'd assume your best using UTF8Pos if the index is important.  As would be the case when doing the ReplaceText,  eg. you would use the index from UTF8Pos that could then go into UTFDelete & UTFInsert.

With my version of Lazarus, I'm not sure these Unicode functions are really efficient or even correct.  eg.  UTF8Pos, internally uses the normal system.pos and then uses UTF8Length to get the index.  I'd worry about code-point changes giving false positives.

If these haven't been updated in the latest Lazarus I'd be willing to have a go at making a more efficient versions, that would basically handle the code points manually, and possible handle unicode duplicates and also making it a single pass function.



KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #2 on: April 18, 2012, 03:48:37 pm »
Quote
I'd worry about code-point changes giving false positives.

Actually scrap that one, I think the people at Unicode thought of that :),  All sub-bytes are prefixed with binary 10.

But the single pass idea still stands, I'd have to do some performance tests to see if it makes a difference though.

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #3 on: April 18, 2012, 11:54:48 pm »
Yet another question is how to replace a substring i a string.

I've made a simple UTF8StringReplace that's similar to StringReplace that I think will do what you want.

Code: [Select]
function UTF8StringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  uS,uOld:string;
begin
  if rfIgnoreCase in Flags then
  begin
    uS := UTF8UpperCase(S);
    uOld := UTF8UpperCase(OldPattern);
    Flags := Flags - [rfIgnoreCase]; //no point uppercasing again
    result := StringReplace(uS,uOld,NewPattern,Flags);
  end else result := StringReplace(S,OldPattern,NewPattern,Flags);
end; 

example usage:
Code: [Select]
eTest.Text := UTF8StringReplace(eTest.Text,'BJÖRN','舒淇',[rfIgnoreCase,rfReplaceAll]);


I've also created a UTF8Pos that on my machine is about 40% faster.  I've also created an overload version for getting the BytePos, and setting the startfrom(bytepos), (if setting the startfrom, returned char pos is relative to this.).

Code: [Select]
function UTF8Pos(const SearchForText, SearchInText: string; out bytepos:PtrInt;startfrom:PtrInt=1): PtrInt; overload;
var
  //lets use pByte instead of pChar avoid ord() or casting
  pLast,pSearch,pSearchEnd,pText:pByte;
  lenSearch,lenText,charPos:PtrInt;
  cwSearch,cwText:integer;
  //
  function CL(p:pbyte):integer; inline;
  var
    t0,t1,t2,t3:byte;
  begin
    //check for 1 char len
    t0 := p[0];
    if t0 < %11000000 then begin result := 1; exit; end;
    //check for 2 char len
    t1 := p[1];
    if t0 and %11100000 = %11000000 then
    begin
      if t1 and %11000000 = %10000000 then result := 2
      else result := 1;
      exit;
    end;
    //check for 3 char len
    t2 := p[2];
    if t0 and %11110000 = %11100000 then
    begin
      if (t1 and %11000000 = %10000000) and
         (t2 and %11000000 = %10000000) then result := 3
      else result := 1;
      exit;
    end;
    //check for 4 char len
    t3 := p[3];
    if t0 and %11111000 = %11110000 then
    begin
      if (t1 and %11000000 = %10000000) and
         (t2 and %11000000 = %10000000) and
         (t3 and %11000000 = %10000000) then result := 4
      else result := 1;
      exit;
    end;
    result := 1; //should not get here.
  end;
  //
  function AllEqual:boolean; inline;
  var
    cwSearch2,cwText2:integer;
    pSearch2,pText2:pByte;
  begin
    pSearch2 := pSearch + cwSearch;
    pText2 := pText + cwText;
    while pSearch2 <= pSearchEnd do
    begin
      cwSearch2 := CL(pSearch2);
      cwText2 := CL(pText2);
      if cwSearch2 <> cwText2 then exit;
      if not comparemem(pSearch2,pText2,cwText2) then
      begin
        result := false;
        exit;
      end;
      inc(pSearch2,cwSearch2);
      inc(pText2,cwText2);
    end;
    result := true;
  end;
  //
begin
  result := 0;
  lenSearch := length(SearchForText);
  lenText := length(SearchInText);
  if (lenSearch=0) or (lenSearch>lenText) then exit;
  pText  :=@SearchInText[startFrom];
  pSearch:=@SearchForText[1];
  pSearchEnd:=@SearchForText[lenSearch];
  charPos := 0; bytepos := 0;
  pLast := @pText[1+lenText-lenSearch];
  cwSearch := CL(pSearch);
  while pText < pLast do
  begin
    cwText := CL(pText);
    if (cwSearch=cwText) then
      if compareMem(pSearch,pText,cwText) then
        if AllEqual then
        begin
          result := charPos+1;
          bytepos := ptrInt(pText)-ptrInt(@SearchInText[1])+1;
          exit;
        end;
    inc(pText,cwText); //move to next char
    inc(charPos);
  end;
  bytePos := 0;
end;

function UTF8Pos(const SearchForText, SearchInText: string): PtrInt; overload;
var
  dum:PtrInt;
begin
  result := UTF8Pos(SearchForText, SearchInText, dum);
end;   

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #4 on: April 19, 2012, 09:11:10 am »
Thanks for the help.
I have already done:
Code: [Select]
NewString:=AnsiReplaceText (String1,UTF8LowerCase(SearchIgnoreChars[i]),'');Maybe
Code: [Select]
NewString:=AnsiReplaceStr (string1,UTF8LowerCase(string2),string3); should work faster, at least I suppose that AnsiReplaceText is slower than AnsiReplaceStr.

Oops, in my case string3 was ''' (empty), obviously this code won't do for other cases.
Also, your code does not work properly, too, because the result string is uppercase.
A solution (probably extremely slow) is:
Code: [Select]
function UTF8StringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  StartPosition: integer;
begin
     StartPosition:=PosEx (UTF8LowerCase(OldPattern),UTF8LowerCase  (S),1);
     Result:= LeftStr(S,StartPosition-1)+ NewPattern+ RightStr(s,Length  (s)-StartPosition-Length(OldPattern)+1);
end;
It will replace the first occurrence only, so some cycling is needed.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Case insensitive search and replace functions for strings.
« Reply #5 on: April 19, 2012, 10:01:23 am »
I've made a simple UTF8StringReplace that's similar to StringReplace that I think will do what you want.
...
I've also created a UTF8Pos that on my machine is about 40% faster.  I've also created an overload version for getting the BytePos, and setting the startfrom(bytepos), (if setting the startfrom, returned char pos is relative to this.).

KpjComp, for my own selfish reasons: if you think these functions are better than existing FPC/Lazarus functions or fill a functionality hole, could you please post them as patches to the bug tracker? Hopefully the core devs will look at them and implement them.
Note: I'm still trying to stay away from Unicode a bit, hoping FPC 2.7+ and Lazarus will be able to work more seamlessly together... who knows, a man can dream... so perhaps these functions are unsuitable/incompatible. In that case, please ignore this ;)

Thanks,
BigChimp
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #6 on: April 19, 2012, 10:18:21 am »
because the result string is uppercase.

Oops, yes. 

I was starting to write a StringReplace that used the utf8Pos and for some reason thought of this.  It was late!!  :-[

I have another idea, now I'm awake I could try, Create a uf8Pos that's also case-insensitive.

Another way, is also doing an uppercase on the search and another uppercase on the Text, but only for getting the positions.  The only problem here is I've a feeling uppercasing a utf8char could potentially alter it's char-width.  So all position calculations I believe would would need to be in Char pos, rather than byte pos causing linear search for each one to get the byte-pos. (UTF8CharStart).

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #7 on: April 19, 2012, 11:30:43 am »
And it came as no surprise, that there is a BUG in the Lazarus IDE (0.9.30, FPC 2.4.2)- searching in the Source editor is always case sensitive (for non-latin letters). Where is the Lazarus chieftain?

About altering charwidth- if you work with strings, there is no problem. Indeed in UTF8 basic Latin chars are 1 byte long, while the others are two-bytes, but length() returns the number of real letters.

I did a check. With
Code: [Select]
function UTF8StringReplace(const S, OldPattern, NewPattern: string{;  Flags: TReplaceFlags}): string;
var
  StartPosition: integer;
begin
     StartPosition:=PosEx (UTF8LowerCase(OldPattern),UTF8LowerCase  (S),1);
     if StartPosition= 0 then
       Result:= S
     else
       Result:= LeftStr(S,StartPosition-1)+ NewPattern+ RightStr(s,Length  (s)-StartPosition-Length(OldPattern)+1);
end;
running this
Code: [Select]
ShowMessage ('wbJdГЖщ' +crlf+  'JdГ'+ crlf+ 'FgQ' + crlf+ UTF8StringReplace('wbJdгЖщ', 'JdГ','FgQ'));shows exactly what it should. Considering that we have mixed basic Latin + Cyrillic and mixed lowercase+uppercase, it is okay.

crlf is #13 + #10

« Last Edit: April 19, 2012, 11:33:26 am by paskal »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Case insensitive search and replace functions for strings.
« Reply #8 on: April 19, 2012, 12:18:11 pm »
And it came as no surprise, that there is a BUG in the Lazarus IDE (0.9.30, FPC 2.4.2)- searching in the Source editor is always case sensitive (for non-latin letters). Where is the Lazarus chieftain?
He's in the bug tracker, but he'll probably tell you to upgrade to SVN trunk (or if he's in a good mood, a recent snapshot) and see if the bug is still there. Who still plays with 0.9.30/2.4.2 when you can use 0.9.31/2.6.0?

(More seriously... a lot of bugs have been fixed. You might want to check into newever versions if possible).
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

theo

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1933
Re: Case insensitive search and replace functions for strings.
« Reply #9 on: April 19, 2012, 12:34:56 pm »
It works in Lazarus 1.1 r36728M FPC 2.7.1 x86_64-linux-gtk 2

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #10 on: April 19, 2012, 01:29:46 pm »
About altering charwidth- if you work with strings, there is no problem.

try this with your code->
Code: [Select]
utf8stringReplace('abcdefghijklmn','İ','X');

The answer of course is "abcdefghXjklmn", but instead you will get "abcdefghXklmn"
IOW: it's dropped the 'j' because 'İ' and 'i' have different char-widths.

I have an idea for a fast utf8StringReplace that will avoid these problems, and very few pass scans for performance.  I'll post here once done.  ps: It won't be using the utf8pos, the solution is even easier than that.  :)


CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #11 on: April 19, 2012, 02:43:38 pm »
KpjComp, you are right, it skips the j  >:D
My tests was not proper, since it did not consider the order between 2-byte and 1-byte strings.

It works in Lazarus 1.1 r36728M FPC 2.7.1 x86_64-linux-gtk 2
Indeed, it works with Lazarus-0.9.31-36901-fpc-2.6.0-20120419.

In http://lazarus-ccr.sourceforge.net/docs/lcl/lclproc/index-5.html there is Deprecated, see #LazUtils.LazUTF8 for replacements. for many of the UTF8 functions.
Unfortunately, Google did not help me find where is the documentation for #LazUtils.LazUTF8 ?
« Last Edit: April 19, 2012, 03:22:18 pm by paskal »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Case insensitive search and replace functions for strings.
« Reply #12 on: April 19, 2012, 05:26:33 pm »
Quote
Unfortunately, Google did not help me find where is the documentation for #LazUtils.LazUTF8 ?
As usual: none but the source itself (well... the fpdoc xml exists, but it seems only 1-2 functions are documented). You can generate (almost empty) the documentation by yourself. There's a build_lazutils_html.sh script under docs/html directory which you could execute.

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #13 on: April 19, 2012, 10:05:46 pm »
KpjComp, for my own selfish reasons: if you think these functions are better than existing FPC/Lazarus functions or fill a functionality hole, could you please post them as patches to the bug tracker? Hopefully the core devs will look at them and implement them.

That's an idea I will.

Might be best if there are people here who can give these functions a good testing first.
Hopefully paskal will double check, or if anybody else spots any bugs let us know.
Also if there are any other Unicode utils like this that might be a good idea, I might be able to have a look.

This way I can bunch them all up into one bug tracker.

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #14 on: April 19, 2012, 10:17:42 pm »
Ok,

Here is a UTF8StringReplace that will cope with variant char-width's correctly.

example:
Code: [Select]
  edit1.text := utf8stringReplace('ii_abcDEfghIIjklmn_iI','İi','舒淇',[rfignorecase,rfReplaceAll]);

result:
Code: [Select]
舒淇_abcDEfgh舒淇jklmn_舒淇


Code: [Select]
type TUTF8Indexed = record
  charWidths:array of byte;
  rawData:array of DWord;
end;

function UTF8ToUTF8Indexed(const Src:string):TUTF8Indexed;
var
  charLen:PtrInt;
  p,pe:pchar;
  cw,i:integer;
begin
  charLen := UTF8Length(Src);
  setlength(result.charWidths,charLen);
  setlength(result.rawData,charlen);
  //fill our raw data with 0's, so that compares will work when
  //out UF8, have different sizes.
  FillByte(result.rawData[0],sizeof(DWord),0);
  p := @Src[1];
  pe := @Src[length(src)];
  i:=0;
  while p <= pe do
  begin
    cw := UTF8CharacterLength(p);
    result.charwidths[i] := cw;
    move(p^,result.rawData[i],cw);
    inc(p,cw);
    inc(i);
  end;
end;

function UTF8IndexedToUTF8(const Src:TUTF8Indexed):string;
var
  uPos,iPos,len,cc:PtrInt;
begin
  //first set our result to maximum size possible,
  cc := length(Src.rawData);
  setlength(result,4*cc);
  uPos := 0;
  for iPos := 0 to cc-1 do
  begin
    move(src.rawData[iPos],result[upos+1],src.charWidths[iPos]);
    inc(upos,src.charWidths[ipos]);
  end;
  setlength(result,uPos);
end;

function UTF8StringReplace(const S, OldPattern, NewPattern: string;  Flags: TReplaceFlags): string;
var
  osrc,src,search:TUTF8Indexed;
  outpos,lpTextLen,lpSearchLen,lpTextEnd,lpTextPos:PtrInt;
const
  //constanst for string expansion, basically lets start at 128 bytes, then
  //increase double each time and max out at 32K.
  SizeResultStart=128;
  SizeResultMaxInc=32*1024;
  //
  procedure NeedSize(s:PtrInt);
  var
    cl,ns:PtrInt;
  begin
    //use our constants to control string expansion
    cl := length(result);
    if cl>=s then exit;
    ns := cl;
    if ns > SizeResultMaxInc then ns := SizeResultMaxInc;
    setlength(result,cl+ns);
  end;
  //
  procedure CopyNextChar;
  begin
    NeedSize(outpos+src.charWidths[lpTextPos]);
    move(osrc.rawData[lpTextPos],result[outpos+1],osrc.charWidths[lpTextPos]);
    inc(outpos);
    inc(lpTextPos);
  end;
  //
  function AllEqual:boolean;
  var
    lp:PtrInt;
  begin
    result := false;
    for lp := 1 to length(search.rawData)-1 do
    begin
      if search.RawData[lp] <> src.rawData[lpTextPos+lp] then exit;
    end;
    result := true;
  end;
  //
begin
  if rfIgnoreCase in Flags then begin
    //this is our ignore case, so lets lowercase search & text
    osrc := UTF8ToUTF8Indexed(S);
    src := UTF8ToUTF8Indexed(UTF8LowerCase(S));
    //osrc and src should be same size, lets double check
    if length(osrc.rawData)<>length(src.rawData) then
      raise Exception.Create('Lengths not equal');
    search := UTF8ToUTF8Indexed(UTF8LowerCase(OldPattern));
    //if there is nothing in our search, then there is nothing to replace
    if length(search.rawData) < 1 then begin
      result := S;
      exit;
    end;
    //ok mow have src & search in lowercase so our finds will work
    setlength(result,SizeResultStart);
    outpos := 0;
    lpTextLen := length(src.charWidths);
    lpSearchLen := length(search.charWidths);
    lpTextEnd := lpTextLen-lpSearchLen+1;
    lpTextPos := 0;
    //let's find our first char match
    while lpTextPos < lpTextEnd do
    begin
      if src.rawData[lpTextPos]=search.rawData[0] then
      begin
        //we have a match, are the rest equal.
        if AllEqual then
        begin
          if length(newPattern)>0 then
          begin
            NeedSize(outpos+length(newPattern));
            move(newPattern[1],result[outpos+1],length(newPattern));
            inc(outpos,length(newPattern));
          end;
          inc(lpTextPos,length(search.rawData));
          if rfReplaceAll in Flags then continue
          else break;
        end;
      end;
      //if we get here, there was no match, just copy char to output
      CopyNextChar;
    end;
    //grab any remaining..
    while lpTextPos < lpTextLen do CopyNextChar;
    //resize result, to correct size.
    setlength(result,outpos);
  end else begin
    //if were not bothered about case, the standard StringReplace should work.
    result := StringReplace(S,OldPattern,NewPattern,Flags);
  end;
end;

 

TinyPortal © 2005-2018