Recent

Author Topic: How to search UTF8 characters with regular expressions  (Read 21796 times)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
How to search UTF8 characters with regular expressions
« on: October 25, 2017, 04:30:40 am »
Hi, I wanted to use regular expressions to search for UTF8 characters, but I didn't succeed.

The results returned by the following code are not what I expected:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   RegExpr;
  8.  
  9. var
  10.   Str, Pattern: string;
  11.   expr: TRegExpr;
  12.   Found: boolean;
  13.  
  14. begin
  15.   Str := '一二三四五六七八九十';
  16.   Pattern := '[三六]';
  17.   expr := TRegExpr.Create(Pattern);
  18.   Found := expr.Exec(Str);
  19.   while Found do begin
  20.     Write(expr.MatchPos[0], ' ');
  21.     Found := expr.ExecNext;
  22.   end;
  23.   expr.Free;
  24. end.

I need the result is "3 6", but it is "1 2 4 7 8 9 10 13 16 17 18 19 20 22 23 25 28", What should I do? Can someone help me?

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to search UTF8 characters with regular expressions
« Reply #1 on: October 25, 2017, 05:56:54 am »
Copy RegExpr to your project folder and Define Unicode (search for the first "Unicode" in RegExpr file and remove the dot before it). This makes RegExpr use WideString

Another possibility is to use FLRE which uses UTF8:
Code: Pascal  [Select][+][-]
  1. var
  2.   Str, Pattern: string;
  3.   re: TFLRE;
  4.   mc: TFLREMultiCaptures;
  5.   i: Integer;
  6.  
  7. begin
  8.   Str := '一二三四五六七八九十';
  9.   Pattern := '[三六]';
  10.  
  11.   re := TFLRE.Create(Pattern,[rfUTF8]);
  12.   re.MatchAll(Str, mc,1,10);
  13.   for i := Low(mc) to High(mc) do
  14.     WriteLn('Start: ',mc[i][0].Start,', Length: ', mc[i][0].Length);
  15.  
  16.   re.Free;
  17.   SetLength(mc, 0);


tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #2 on: October 25, 2017, 08:00:06 am »
Copy RegExpr to your project folder and Define Unicode (search for the first "Unicode" in RegExpr file and remove the dot before it). This makes RegExpr use WideString

Another possibility is to use FLRE which uses UTF8:

Thank you, engkin, I tried the first method, but did not succeed, the results are the same as before, where did i do wrong?
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   RegExpr;  // I copied RegExpr and removed the dot in front of "Unicode"
  8.  
  9. var
  10.   Str : WideString;  // I changed this into WideString
  11.   Pattern: String;
  12.   expr: TRegExpr;
  13.   Found: boolean;
  14.  
  15. begin
  16.   Str := '一二三四五六七八九十';
  17.   Pattern := '[三六]';
  18.   expr := TRegExpr.Create(Pattern);
  19.   Found := expr.Exec(Str);
  20.   while Found do begin
  21.     Write(expr.MatchPos[0], ' ');
  22.     Found := expr.ExecNext;
  23.   end;
  24.   expr.Free;
  25. end.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #3 on: October 25, 2017, 09:52:56 am »
Another possibility is to use FLRE which uses UTF8:

@engkin, I have tried FLRE, but I can't get the Result, did I make a mistake? this is my code:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   FLRE;
  8.  
  9. var
  10.   Str : String;
  11.   Pattern: String;
  12.   re: TFLRE;
  13.   cs: TFLRECaptures;
  14.   Found: boolean;
  15.  
  16. begin
  17.   Str := '一二三四五六七八九十';
  18.   Pattern := '[三六]';
  19.  
  20.   re := TFLRE.Create(Pattern, [rfUTF8]);
  21.   Found := re.Match(Str, cs);
  22.   while Found do begin
  23.     writeln(Copy(Str, cs[0].start, cs[0].length));
  24.     Found := re.MatchNext(Str, cs, cs[0].start + cs[0].length);
  25.   end;
  26.  
  27.   re.Free;  
  28.   SetLength(cs, 0);
  29. end.
« Last Edit: October 25, 2017, 10:15:35 am by tomitomy »

ASerge

  • Hero Member
  • *****
  • Posts: 2416
Re: How to search UTF8 characters with regular expressions
« Reply #4 on: October 25, 2017, 11:12:29 am »
Thank you, engkin, I tried the first method, but did not succeed, the results are the same as before, where did i do wrong?
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2. program Project1;
  3. {$mode objfpc}{$H+}
  4. {$CODEPAGE UTF8}
  5.  
  6. uses
  7.   Classes,
  8.   RegExpr; // with {$DEFINE Unicode}
  9.  
  10. var
  11.   Str: UnicodeString;
  12.   Pattern: UnicodeString;
  13.   Expr: TRegExpr;
  14.   Found: Boolean;
  15. begin
  16.   Str := '一二三四五六七八九十';
  17.   Pattern := '[三六]';
  18.   Expr := TRegExpr.Create;
  19.   Expr.Expression := Pattern;
  20.   Found := Expr.Exec(Str);
  21.   while Found do begin
  22.     Write(Expr.MatchPos[0], ' ');
  23.     Found := Expr.ExecNext;
  24.   end;
  25.   Expr.Free;
  26.   Readln;
  27. end.
On my window 7 show
Quote
3 6

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #5 on: October 25, 2017, 01:51:06 pm »
Thank you, ASerge, your reply is very helpful, the key is {$CODEPAGE UTF8}, My problem is solved, thank you very much! :)

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to search UTF8 characters with regular expressions
« Reply #6 on: October 26, 2017, 04:00:40 am »
Notice that TRegExpr will not be able to handle Unicode correctly, it only works with UCS-2. For instance, the following string and pattern will not work right:
  Str := '一二三四五六七八😕九十';
  Pattern := '[三六😕]';

Also, you don't need {$CODEPAGE UTF8} if you add LazUTF8 from package LazUtils to your uses section.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #7 on: October 26, 2017, 10:48:41 am »
Notice that TRegExpr will not be able to handle Unicode correctly, it only works with UCS-2. For instance, the following string and pattern will not work right:
  Str := '一二三四五六七八九十';
  Pattern := '[三六]';

Also, you don't need {$CODEPAGE UTF8} if you add LazUTF8 from package LazUtils to your uses section.

Oh, sad, I tested [三六?], it doesn't work normally, Is there a better way? I have tried TFLRE.Match and TFLRE.MatchNext, but it doesn't work.
(I can't input face char in the post. I replace it with "?")

wp

  • Hero Member
  • *****
  • Posts: 12857
Re: How to search UTF8 characters with regular expressions
« Reply #8 on: October 26, 2017, 11:19:05 am »
Is there a better way?
While many people like regular expressions I don't - too cryptic...

Do it manually in good old Pascal using the UTF8 routines in LazUTF8:

Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8;
  3.  
  4. procedure TForm1.Button1Click(Sender: TObject);
  5. var
  6.   Str, Pattern: string;
  7.   p: PChar;
  8.   msg: TStrings;   // Collects the results
  9.   codepoint: Cardinal;
  10.   ch: String;
  11.   chlen: Integer;
  12.   foundAt: Integer;
  13. begin
  14.   Str := '一二三四五六七八九十';
  15.   Pattern := '三六a一2ä';
  16.  
  17.   msg := TStringList.Create;
  18.   try
  19.     p := PChar(Pattern);
  20.     while p^ <> #0 do begin
  21.       codepoint := UTF8CharacterToUnicode(p, chlen);
  22.       ch := UnicodeToUTF8(codepoint);
  23.       foundAt := UTF8Pos(ch, Str);
  24.       if foundAt > 0 then
  25.         msg.Add(ch + ' found at character position ' + IntToStr(foundAt))
  26.       else
  27.         msg.Add(ch + ' NOT found');
  28.       inc(p, chlen);
  29.     end;
  30.     ShowMessage(msg.Text);
  31.   finally
  32.     msg.Free;
  33.   end;
  34. end;  

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #9 on: October 26, 2017, 11:53:22 am »
Do it manually in good old Pascal using the UTF8 routines in LazUTF8:

Thank you wp, but I want to implement the regular expression search function in my program. Thank you for your reply! :)

BeniBela

  • Hero Member
  • *****
  • Posts: 927
    • homepage
Re: How to search UTF8 characters with regular expressions
« Reply #10 on: October 26, 2017, 12:33:56 pm »
Another possibility is to use FLRE which uses UTF8:

@engkin, I have tried FLRE, but I can't get the Result, did I make a mistake? this is my code:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   FLRE;
  8.  
  9. var
  10.   Str : String;
  11.   Pattern: String;
  12.   re: TFLRE;
  13.   cs: TFLRECaptures;
  14.   Found: boolean;
  15.  
  16. begin
  17.   Str := '一二三四五六七八九十';
  18.   Pattern := '[三六]';
  19.  
  20.   re := TFLRE.Create(Pattern, [rfUTF8]);
  21.   Found := re.Match(Str, cs);
  22.   while Found do begin
  23.     writeln(Copy(Str, cs[0].start, cs[0].length));
  24.     Found := re.MatchNext(Str, cs, cs[0].start + cs[0].length);
  25.   end;
  26.  
  27.   re.Free;  
  28.   SetLength(cs, 0);
  29. end.

You need to call re.UTF8MatchAll not Match

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #11 on: October 26, 2017, 01:20:32 pm »
You need to call re.UTF8MatchAll not Match

Thank you BeniBela, but, there are many functions in FLRE, can only MatchAll be used for UTF8? Other not?

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4571
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #12 on: October 26, 2017, 01:40:27 pm »
I agree that regular expressions are not needed for finding characters in a string. They are better for complex tests that would need a parser otherwise.

Do it manually in good old Pascal using the UTF8 routines in LazUTF8:
Code: Pascal  [Select][+][-]
  1. ...
  2.   codepoint := UTF8CharacterToUnicode(p, chlen);
  3.   ch := UnicodeToUTF8(codepoint);
  4.   foundAt := UTF8Pos(ch, Str);
  5. ...
The text is already encoded as UTF-8. It means you do 2 useless conversions. This example shows a more efficient way:
 http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints
BTW, often Pos() is more usefull than UTF8Pos() also with Unicode. For example if you want to copy the text up to the found position, you need the byte offset. Pos() is also much faster than UTF8Pos().

A better and cleaner and more readable solution is to use an iterator defined in unit LazUnicode.
I just tested this with a Button and a Memo:
Code: Pascal  [Select][+][-]
  1. uses ... LazUTF8, LazUnicode;
  2. ...
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5.   ch, Str, Pattern: String;
  6.   At: Integer;
  7. begin
  8.   Str := '一二三四五六七八九十';
  9.   Pattern := '三六a一2ä';
  10.   for ch in Pattern do begin
  11.     At := Pos(ch, Str);
  12.     if At > 0 then
  13.       Memo1.Lines.Add(Format('%s found at byte position %d, character position %d.',
  14.                              [ch, At, UTF8Pos(ch,Str)]));
  15.   end;
  16. end;
Actually the "character position" is a wrong term. They are codepoint positions because the iterator by default uses the TCodePointEnumerator.
There is also TUnicodeCharacterEnumerator which takes care of combining accent codepoints.
Should I make it the default for iterator? What problems could be caused?
« Last Edit: October 26, 2017, 01:43:41 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #13 on: October 26, 2017, 03:02:02 pm »
I'm Sorry, I didn't fully express my needs, I don't just want to find a string, I also want to replace a string, for example, I want to add a space between the Chinese character and English character, or I want to tidy up a messy typography of an article, Or I want to get the text I need from the HTML code. etc..

I have implemented a simple string search and replace function with "string.IndexOf", and now I want to implement the functionality of the regular expression in my program to meet the more complex requirements.

I'm not familiar with the iterator in Lazarus. have other people participate in the discussion?
« Last Edit: October 26, 2017, 03:09:43 pm by tomitomy »

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4571
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #14 on: October 26, 2017, 05:08:57 pm »
I'm not familiar with the iterator in Lazarus. have other people participate in the discussion?
I don't know if anyone actively uses it yet. Please try it anyway.
Just use "LazUnicode" and the "for ... in" syntax starts to work.
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

 

TinyPortal © 2005-2018