Recent

Author Topic: Case insensitive search and replace functions for strings.  (Read 27984 times)

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #15 on: April 20, 2012, 08:21:04 am »
Both mine and your tests work this way, still this is only a single string occurrence replacement:

Code: [Select]
function UTF8StringReplace(const S, OldPattern, NewPattern: string{;  Flags: TReplaceFlags}): string;
var
  StringFull: UTF16String;
  StartPosition: integer;
begin
     StartPosition:=PosEx (UTF8LowerCase(OldPattern),UTF8LowerCase  (s),1);
     if StartPosition= 0 then
       Result:= s
     else
     begin
       StringFull:= UTF8ToUTF16 (s) ;

       //For some reason it works this way:
       Result:= UTF16ToUTF8( LeftStr(StringFull,StartPosition-1)+ NewPattern+ RightStr(StringFull, UTF8Length (StringFull)-StartPosition-UTF8Length(OldPattern)+1));

       //It does not work this way:
       //Result:= UTF16ToUTF8(LeftStr(StringFull,StartPosition-1)+ NewPattern+ RightStr(StringFull, UTF16Length (StringFull)-StartPosition-UTF16Length(OldPattern)+1));

       //And it does not work this way neither. If utf8toutf16 is done before POSEX, POSEX can't find the sought string, probably due to the fact that there is no UTF16LOWERCASE function.
       //Result:= UTF16ToUTF8(LeftStr(StringFull,StartPosition-1)+ NewPattern+ RightStr(StringFull, UTF16Length (StringFull)-StartPosition-UTF16Length(OldPattern)+1));
     end;
end;

Leledumbo, you mean, that I have to download the Lazarus sources and run
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/docs/html/build_lazutils_html.sh?root=lazarus&view=log under Linux?
« Last Edit: April 20, 2012, 08:44:59 am by paskal »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #16 on: April 20, 2012, 11:40:34 am »
Both mine and your tests work this way, still this is only a single string occurrence replacement:

Not really.

A couple of points on your code.

1. Your assuming UTF16 can hold all characters in 1, again it's potentially possible for a unicode char to be made up of 2 UTF16 codepoints.  The only unicode format guaranteed to take up one is a UTF32.
2. Your mixing UTF8 & UTF16 on the same line
3. It doesn't work.

Try this with your code.
Code: [Select]
edit1.text := utf8stringReplace('ii_abcDEfghIIjklmn_iI','İi','舒淇');

result->
舒淇_abcDEfghIIjklmn_iI

To make it work better, you could of course put a UTF8ToUTF16 around your NewPattern, but you still have the potential of a double UTF16, not sure if there any double UTF16 that would change the number of codepoints if transformed, so it might be OK, not sure.

Also you might have noticed my code already handles the rfReplaceAll flag.

There is a couple of things I could do to my function to make it better.

The amount of memory required is potentially 10x the the src string, I think you'd need a pretty large unicode string for this to be a problem.  To reduce memory usage I could implement a ring buffer, a bit like the http://wiki.lazarus.freepascal.org/Rosetta_Stone,  or alternatively just keep a couple of more pointers running.  Both these option of course complicate the code, currently I believe my code is very easy to follow.

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #17 on: April 23, 2012, 08:25:41 am »
1. Your assuming UTF16 can hold all characters in 1, again it's potentially possible for a unicode char to be made up of 2 UTF16 codepoints.  The only unicode format guaranteed to take up one is a UTF32.
Quite so, I thought that UTF16 is always 2 bytes per char, but it occurred that they might be up to 4.
2. Your mixing UTF8 & UTF16 on the same line
Yes, this was the only of the three option, that seemingly gave a proper result.
3. It doesn't work.
That's a point.

Try this with your code.
Code: [Select]
edit1.text := utf8stringReplace('ii_abcDEfghIIjklmn_iI','İi','舒淇');

result->
舒淇_abcDEfghIIjklmn_iI

For me the result looks quite differently, but since I insert two squares, I should get two squares.
The odd thing is that I get the improper chars at the proper place.
BTW, I tried ShowMessage (utf16toutf8(UTF8ToUTF16('舒淇'))); and it works fine (Well, assuming that the squares displayed in the Lazarus IDE are what they should be).

To make it work better, you could of course put a UTF8ToUTF16 around your NewPattern, but you still have the potential of a double UTF16, not sure if there any double UTF16 that would change the number of codepoints if transformed, so it might be OK, not sure.
You are absolutely right here, now all 3 examples works.. .or at least they seem to, since the hieroglyphs are squares for me.


function UTF8StringReplace(const S, OldPattern, NewPattern: string{;  Flags: TReplaceFlags}): string;
var
  StringFull: UTF16String;
  StartPosition: integer;
begin
     StartPosition:=PosEx (UTF8LowerCase(OldPattern),UTF8LowerCase  (s),1);
     if StartPosition= 0 then
       Result:= s
     else
     begin
       StringFull:= UTF8ToUTF16 (s) ;
       Result:= UTF16ToUTF8(LeftStr(StringFull,StartPosition-1)+ UTF8ToUTF16(NewPattern)+ RightStr(StringFull, UTF8Length (StringFull)-StartPosition-UTF8Length(OldPattern)+1));
     end;
end;



Also you might have noticed my code already handles the rfReplaceAll flag.
Surely, I did. Maybe I'll spend some time to add the cycling.
« Last Edit: April 23, 2012, 08:59:11 am by paskal »
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #18 on: April 23, 2012, 01:11:26 pm »
Quote
Maybe I'll spend some time to add the cycling.

You could do if you so wish, but it might become very slow if you just keep cycling.  The version I've done is a single pass, IOW: if they was a thousand replace's to do, it would still be one pass.  No cycling required, and handles the double/tripple&quad UTF8's.

Anyway, it's totally up to you.

CM630

  • Hero Member
  • *****
  • Posts: 1641
  • Не съм сигурен, че те разбирам.
    • http://sourceforge.net/u/cm630/profile/
Re: Case insensitive search and replace functions for strings.
« Reply #19 on: May 22, 2012, 10:29:50 am »
So far I have found one problem in KpjComp's UTF8StringReplace- it does not handle the case when the input string S is empty (zero length) and this can lead to an exception, so I added a check for this.
Лазар 4,4 32 bit (sometimes 64 bit); FPC3,2,2

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #20 on: May 29, 2012, 10:37:22 am »
Thanks for testing and feedback.. 

KpjComp

  • Hero Member
  • *****
  • Posts: 680
Re: Case insensitive search and replace functions for strings.
« Reply #21 on: May 29, 2012, 06:41:22 pm »
Code: [Select]
So far I have found one problem in KpjComp's UTF8StringReplace-

Ludo Brands in the bugtracker spotted one.

If the source string had multibyte's it wasn't incrementing the outpos properly, ironically I had accounted for it in memory allocation.

Just change the CopyNextChar sub procedure to ->

Code: [Select]
  procedure CopyNextChar;
  begin
    NeedSize(outpos+src.charWidths[lpTextPos]);
    move(osrc.rawData[lpTextPos],result[outpos+1],osrc.charWidths[lpTextPos]);
    inc(outpos,osrc.charWidths[lpTextPos]);
    inc(lpTextPos);
  end;

 

TinyPortal © 2005-2018