Recent

Author Topic: How to search UTF8 characters with regular expressions  (Read 21903 times)

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #15 on: October 26, 2017, 05:44:12 pm »
I don't know if anyone actively uses it yet. Please try it anyway.
Just use "LazUnicode" and the "for ... in" syntax starts to work.

Thank you, I will try to use "LazUnicode" and the "for ... in" syntax in the future.

BeniBela

  • Hero Member
  • *****
  • Posts: 928
    • homepage
Re: How to search UTF8 characters with regular expressions
« Reply #16 on: October 26, 2017, 07:08:55 pm »
You need to call re.UTF8MatchAll not Match

Thank you BeniBela, but, there are many functions in FLRE, can only MatchAll be used for UTF8? Other not?

every function whose name starts with UTF8 can be used with UTF8

Do not use (utf8) Match. I was confused, too

There is also UTF8Find



Code: Pascal  [Select][+][-]
  1.   ch, Str, Pattern: String;

A string loop variable? Creating a new string for every character?

That is a horrible idea

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4608
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #17 on: October 26, 2017, 07:13:59 pm »
A string loop variable? Creating a new string for every character?
That is a horrible idea
No, it is a very good idea!
We are dealing with variable width encodings here. Then one codepoint or "character" can occupy more than one codeunit. String is the most natural way to store it. This applies to both UTF-8 and UTF-16.
How would you do it?
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

BeniBela

  • Hero Member
  • *****
  • Posts: 928
    • homepage
Re: How to search UTF8 characters with regular expressions
« Reply #18 on: October 26, 2017, 07:19:15 pm »
A string loop variable? Creating a new string for every character?
That is a horrible idea
No, it is a very good idea!
We are dealing with variable width encodings here. Then one codepoint or "character" can occupy more than one codeunit. String is the most natural way to store it. This applies to both UTF-8 and UTF-16.
How would you do it?

Codepoint by codepoint in an integer

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4608
  • I like bugs.
Re: How to search UTF8 characters with regular expressions
« Reply #19 on: October 26, 2017, 07:27:21 pm »
Codepoint by codepoint in an integer
Uhhh, so clumsy! You need back and forth conversions. You cannot compare against literals like:
Code: Pascal  [Select][+][-]
  1. if ch = '五' then
  2.  ...
and so on. Besides, how do you search for an integer codepoint from a string using Pos() or similar function?
The type "String" is so nice because it behaves like an atomic type. The underlying array nature is hidden in this use case.
« Last Edit: October 26, 2017, 07:38:15 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to search UTF8 characters with regular expressions
« Reply #20 on: October 26, 2017, 08:02:17 pm »
every function whose name starts with UTF8 can be used with UTF8

The source code shows no difference between functions that start with UTF8 and the ones without it. What really matters is using rfUTF8 flag (or modifier u).

wp

  • Hero Member
  • *****
  • Posts: 12926
Re: How to search UTF8 characters with regular expressions
« Reply #21 on: October 27, 2017, 12:37:46 am »
The text is already encoded as UTF-8. It means you do 2 useless conversions. This example shows a more efficient way:
 http://wiki.freepascal.org/UTF8_strings_and_characters#Iterating_over_string_analysing_individual_codepoints
BTW, often Pos() is more usefull than UTF8Pos() also with Unicode. For example if you want to copy the text up to the found position, you need the byte offset. Pos() is also much faster than UTF8Pos().
Thanks for clarifying this. I had tried it first, it did not work, I tried the other solution, it worked, and I posted it without thinking...

A better and cleaner and more readable solution is to use an iterator defined in unit LazUnicode.
I must confess that I never have used this unit. What confused me is the "Unicode" in its name which makes it appear to be something like a helper unit for Delphi's kind of Unicode (UTF16), but actually it is for UTF8. Why didn't you put these routines into LazUTF8? I know that the term "Unicode" contains UTF8 as well as UTF16 as well as UTF32, but when people talk of "unicode" they usually mean the UTF-16 of Delphi only. Damn confusing...

BeniBela

  • Hero Member
  • *****
  • Posts: 928
    • homepage
Re: How to search UTF8 characters with regular expressions
« Reply #22 on: October 27, 2017, 01:10:35 am »
Codepoint by codepoint in an integer
Uhhh, so clumsy! You need back and forth conversions. You cannot compare against literals like:
Code: Pascal  [Select][+][-]
  1. if ch = '五' then
  2.  ...
and so on. Besides, how do you search for an integer codepoint from a string using Pos() or similar function?
The type "String" is so nice because it behaves like an atomic type. The underlying array nature is hidden in this use case.

Then it becomes a codepoint comparison ch = $4e94. It is faster than a string comparison, too.

If you iterate over Str and compare the codepoints, it would not need Pos.

Alternatively the iterator could use pchar + length. Then it does not need to allocate a new string and can be used with a pos for pchars. I try to replace all temporary string usages with pchar + length.


FLRE, too. One reason  why it returns the matches as offset + length, rather than a string
« Last Edit: October 27, 2017, 01:13:26 am by BeniBela »

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #23 on: October 27, 2017, 03:42:58 am »
every function whose name starts with UTF8 can be used with UTF8

Do not use (utf8) Match. I was confused, too

There is also UTF8Find

The source code shows no difference between functions that start with UTF8 and the ones without it. What really matters is using rfUTF8 flag (or modifier u).

Thank you BeniBela and engkin. MatchAll can fulfill all of my needs. Thank you very much.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #24 on: October 27, 2017, 08:07:26 am »
There seems to be a BUG in FLRE, Running this code will generate an exception:

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   FLRE;
  8.  
  9. var
  10.   Str, Pattern: string;
  11.   re: TFLRE;
  12.   i: Integer;
  13.  
  14. begin
  15.   Str := '一二三四五六七八九十';
  16.   Pattern := '[三六]';
  17.  
  18.   re := TFLRE.Create(Pattern,[rfUTF8]);
  19.   writeLn(re.Replace(Str, ''));  // Delete the Matched String
  20.   re.Free;
  21. end.


ASerge

  • Hero Member
  • *****
  • Posts: 2444
Re: How to search UTF8 characters with regular expressions
« Reply #25 on: October 27, 2017, 05:26:41 pm »
Also, you don't need {$CODEPAGE UTF8} if you add LazUTF8 from package LazUtils to your uses section.
Are you sure? On my machine this code does NOT work:
Code: Pascal  [Select][+][-]
  1. {$APPTYPE CONSOLE}
  2. program Project1;
  3. {$mode objfpc}{$H+}
  4. {-$CODEPAGE UTF8}
  5.  
  6. uses
  7.   Classes, LazUTF8,
  8.   RegExpr; // with {$DEFINE Unicode}
  9.  
  10. var
  11.   Str: UnicodeString;
  12.   Pattern: UnicodeString;
  13.   Expr: TRegExpr;
  14.   Found: Boolean;
  15. begin
  16.   Str := '一二三四五六七八九十';
  17.   Pattern := '[三六]';
  18.   Expr := TRegExpr.Create;
  19.   Expr.Expression := Pattern;
  20.   Found := Expr.Exec(Str);
  21.   while Found do begin
  22.     Write(Expr.MatchPos[0], ' ');
  23.     Found := Expr.ExecNext;
  24.   end;
  25.   Expr.Free;
  26.   Readln;
  27. end.

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to search UTF8 characters with regular expressions
« Reply #26 on: October 27, 2017, 06:32:13 pm »
Also, you don't need {$CODEPAGE UTF8} if you add LazUTF8 from package LazUtils to your uses section.
Are you sure? On my machine this code does NOT work:
[..]

Yes! You need to use String instead of UnicodeString:
Code: Pascal  [Select][+][-]
  1. uses
  2.   LazUTF8...
  3.  
  4.   Str: String;
  5.   Pattern: String;

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: How to search UTF8 characters with regular expressions
« Reply #27 on: October 27, 2017, 07:17:10 pm »
There seems to be a BUG in FLRE, Running this code will generate an exception:

Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes,
  7.   FLRE;
  8.  
  9. var
  10.   Str, Pattern: string;
  11.   re: TFLRE;
  12.   i: Integer;
  13.  
  14. begin
  15.   Str := '一二三四五六七八九十';
  16.   Pattern := '[三六]';
  17.  
  18.   re := TFLRE.Create(Pattern,[rfUTF8]);
  19.   writeLn(re.Replace(Str, ''));  // Delete the Matched String
  20.   re.Free;
  21. end.

I did not get any exception. What exception did you get?

How about this code:
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes, LazUTF8,
  7.   FLRE, Windows;
  8.  
  9. var
  10.   Str, Pattern, newStr: string;
  11.   re: TFLRE;
  12.  
  13. begin
  14.   Str := '一二三四五六七八九十';
  15.   Pattern := '[三六]';
  16.  
  17.   re := TFLRE.Create(Pattern,[rfUTF8]);
  18.   newStr := re.Replace(Str, '');  // Delete the Matched String
  19.   MessageBoxW(0,@UnicodeString(newStr)[1], 'Test FLRE Replace', 0);
  20.   re.Free;
  21.  
  22.   ReadLn;
  23. end.

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #28 on: October 28, 2017, 07:30:45 am »
I did not get any exception. What exception did you get?

I'm in Linux, so I can't use "windows" unit, the exception dialog is in the attachment

tomitomy

  • Sr. Member
  • ****
  • Posts: 251
Re: How to search UTF8 characters with regular expressions
« Reply #29 on: October 28, 2017, 07:43:49 am »
Yes! You need to use String instead of UnicodeString:

I am confused, I do not know "String, WideString, UnicodeString" What is the difference, I do not know "{$ CODEPAGE UTF8}" and "use LazUTF8" and "use LazUnicode" changed what? Why it is so complicated?

 

TinyPortal © 2005-2018