Recent

Author Topic: TregExpr and unicode/cyrillic  (Read 673 times)

Jungle

  • New Member
  • *
  • Posts: 25
TregExpr and unicode/cyrillic
« on: December 13, 2023, 12:40:21 pm »
I want to manipulate the case of a cyrillic text via regex. IIUC according to the info on this page, Lazarus uses TRegExpr by Sorokin which supports Unicode categories like "\p{L}".

When I use "\p{L}*" in my expression, I get the following error:
Quote
TRegExpr compile: incorrect {} braces (pos 4).

If I change "\p{L}*" to "\w*", search fails. If I define WordChars to 'абвгдежзиклмнопрстуфхцчшщьыъэюя', search succeedes but case change fails.

So the following code works ok

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button2Click(Sender: TObject);
  2. var
  3.   r: TRegExpr;
  4.   s: string;
  5.  
  6. begin
  7.   r := TRegExpr.Create();
  8.   try
  9.     s := 'hello';
  10.     r.InputString := s;
  11.     r.Expression  := '(\w*)';
  12.  
  13.     if ( r.Exec() ) then
  14.     begin
  15.       ShowMessage(r.Match[0]);
  16.       s := r.Replace(s, '\u$1', true);
  17.       ShowMessage(s);
  18.     end;
  19.   finally
  20.     r.Free();
  21.   end;
  22. end;

The following doesn't work as expected.
Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button3Click(Sender: TObject);
  2. var
  3.   r: TRegExpr;
  4.   s: String;
  5.  
  6. begin
  7.   r := TRegExpr.Create();
  8.   try
  9.     s := 'привет';
  10.     r.InputString := s;
  11.     r.Expression  := '(\w*)';
  12. //    r.Expression  := '(\p{L}*)'; // TRegExpr compile: incorrect {} braces (pos 4)
  13.     r.WordChars   := 'абвгдежзиклмнопрстуфхцчшщьыъэюя';
  14.  
  15.     if ( r.Exec ) then
  16.     begin
  17.       ShowMessage(r.Match[0]);
  18.       s := r.Replace(s, '\u$1', true);
  19.       ShowMessage(s);
  20.     end;
  21.   finally
  22.     r.Free();
  23.   end;
  24. end;
  25.  

What is the proper way?

Lazarus 2.2.6 x64 on Win11

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10647
  • Debugger - SynEdit - and more
    • wiki
Re: TregExpr and unicode/cyrillic
« Reply #1 on: December 13, 2023, 01:54:00 pm »
\p{} is afaik only supported if the TRegExpr was compiled to use WideChar.

There is a define in the unit <FPC>\packages\regexpr\src\regexpr.pas
Code: Pascal  [Select][+][-]
  1. { off $DEFINE UnicodeRE} // Use WideChar for characters and UnicodeString/WideString for strings

Since it is part of the FPC distribution, the best way is to make a copy, change the define, and recompile.

You then also need to convert your strings to widestring "Utf8ToUtf16()".

Jungle

  • New Member
  • *
  • Posts: 25
Re: TregExpr and unicode/cyrillic
« Reply #2 on: December 13, 2023, 02:10:58 pm »
There is a define in the unit <FPC>\packages\regexpr\src\regexpr.pas
Code: Pascal  [Select][+][-]
  1. { off $DEFINE UnicodeRE} // Use WideChar for characters and UnicodeString/WideString for strings

I can't find this define there. But there're a lot of
Code: Pascal  [Select][+][-]
  1. {$IFDEF UniCode}

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 10647
  • Debugger - SynEdit - and more
    • wiki
Re: TregExpr and unicode/cyrillic
« Reply #3 on: December 13, 2023, 02:28:51 pm »
It seems to have changed in FPC 3.3.1. (and in the origin project).

I  just saw there is a file "uregexp.pas" that may already provide the correct version (not sure though)

If in doubt, the latest version is here https://github.com/andgineer/TRegExpr

Thaddy

  • Hero Member
  • *****
  • Posts: 16300
  • Censorship about opinions does not belong here.
Re: TregExpr and unicode/cyrillic
« Reply #4 on: December 13, 2023, 02:34:35 pm »
Correct, but contrary to your previous remark the string type is unicodestring, not widestring.
If I smell bad code it usually is bad code and that includes my own code.

 

TinyPortal © 2005-2018