Recent

Author Topic: TRegExpr: correct hit count, but only one .match[]?  (Read 225 times)

sierdolg

  • New Member
  • *
  • Posts: 22
TRegExpr: correct hit count, but only one .match[]?
« on: November 22, 2019, 03:15:28 pm »
Summary: Trying to get TRegExpr to work, .Match and a .MatchNext loop find the correct number of hits, but .MatchNext neither returns the matched strings nor their start and length.

Details: Restricted to a minimal example, I have a RichMemo displaying larger amounts of text, in which all dates (in German notation "[D]D.[M]M.YYYY")) should be highlighted in color. Since it is a text with German umlauts, it is Unicode, under Linux it is encoded in UTF8. Because TRegExp seems to return the byte position and not the character position ("{off $DEFINE UniCode}" or "{$DEFINE UniCode}" made no difference), I couldn't think of anything better than just look at the matched string the RegEx yields and subsequently use UTF8Pos to find its character position.

So I thought the following would do the job:

Code: Pascal  [Select]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   e: TRegExpr;
  4.   i: Integer;
  5.   MatchPosU, LastMatchPosU, MatchLenU: Integer; // Start and Length of Match in characters (not bytes)
  6. const
  7.   RE_DATUM = '[0123]?[0-9]\.[01]?[0-9]\.[0-9]{4}'; // just a simple matching demo, no real date validation needed!
  8.   INI_HIGHLIGHT_COLOR = '#FF00FF';
  9. begin
  10.   e := TRegExpr.Create(RE_DATUM);
  11.  
  12.   i := 0;
  13.   if e.Exec(RichMemo1.Text) then
  14.   repeat
  15.     MatchPosU:=UTF8Pos(e.Match[i], RichMemo1.Text, lastMatchPosU+1); //search only in residual part of text
  16.     MatchLenU:=UTF8Length(e.Match[i]);
  17.     WriteLn(i, ': "', e.Match[i], '" at "', MatchPosU, ' for ', MatchLenU, ' characters');
  18.     RichMemo1.SetRangeColor(e.MatchPos[i]-1,
  19.                             e.MatchLen[i],
  20.                             StringToColor(INI_HIGHLIGHT_COLOR));
  21.     LastMatchPosU := MatchPosU;
  22.     i := i + 1 ;
  23.   until not e.ExecNext;
  24.   e.Free;
  25. end;

That works, but only for the first match. Looking at STDOUT,
Code: [Select]
0: "9.5.2012" at "0 for 8 characters
1: "" at "0 for 0 characters
2: "" at "0 for 0 characters
3: "" at "0 for 0 characters
4: "" at "0 for 0 characters
5: "" at "0 for 0 characters
6: "" at "0 for 0 characters
7: "" at "0 for 0 characters
8: "" at "0 for 0 characters
9: "" at "0 for 0 characters
10: "" at "0 for 0 characters
11: "" at "0 for 0 characters

The remaining matches are apparently found by the regex object, but ".match[ i ]" always returns an empty string.

What am I doing wrong or could I have overlooked?

Contextual information: Lazarus 2.0.7 r62276M FPC 3.0.4 x86_64-linux-gtk2 on Linux, Programming experience: quite rookie ,-), Minimal sample project attached.
« Last Edit: November 25, 2019, 11:26:45 am by sierdolg »

Roland57

  • Newbie
  • Posts: 4
Re: TRegExp: correct hit count, but no
« Reply #1 on: November 22, 2019, 04:56:00 pm »
Hello!

I have not tested, but I believe that you should replace i by 0, everywhere.

Thaddy

  • Hero Member
  • *****
  • Posts: 9285
Re: TRegExp: correct hit count, but no
« Reply #2 on: November 22, 2019, 05:02:16 pm »
Yes e.Match[0] everywhere. The higher indices are for sub-expressions, which you don't use.
Also you need not keep pos and length: the regex engine should do that for you on execnext.
« Last Edit: November 22, 2019, 05:04:40 pm by Thaddy »
also related to equus asinus.

sierdolg

  • New Member
  • *
  • Posts: 22
Re: TRegExpr: correct hit count, but no match?
« Reply #3 on: November 25, 2019, 11:26:01 am »
Thanks a lot!! That exactly was my mistake in interpretation. How I can be blind :o)

The following procedure now does exactly what it should (and I also corrected the mutilated subject):

Code: Pascal  [Select]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. var
  3.   e: TRegExpr;
  4.   i: Integer;
  5.   MatchPosU, LastMatchPosU, MatchLenU: Integer; // Start and Length of Match in characters (not bytes)
  6. const
  7.   RE_DATUM = '[0123]?[0-9]\.[01]?[0-9]\.[0-9]{4}'; // just a simple matching demo, no real date validation needed!
  8.   INI_HIGHLIGHT_COLOR = 'clRed';
  9. begin
  10.   e := TRegExpr.Create(RE_DATUM);
  11.  
  12.   i := 0;
  13.   lastMatchPosU := 0;
  14.   if e.Exec(RichMemo1.Text) then
  15.   repeat
  16.     // search only in residual part of text for Character Position of Match,
  17.     // as e.MatchLen[] seems to return a Byte position
  18.     MatchPosU:=UTF8Pos(e.Match[0], RichMemo1.Text, lastMatchPosU+1);
  19.     //MatchLenU:=UTF8Length(e.Match[0]); unnecessary, as e.MatchLen[0] counts Characters
  20.     WriteLn(i, ': "', e.Match[0], '" at "', e.MatchPos[0], ' for ',
  21.             e.MatchLen[0], ' bytes - MatchPosU=', MatchPosU);
  22.     RichMemo1.SetRangeColor(MatchPosU-1,
  23.                             e.MatchLen[0],
  24.                             StringToColor(INI_HIGHLIGHT_COLOR));
  25.     LastMatchPosU := MatchPosU;
  26.     i := i + 1 ;
  27.   until not e.ExecNext;
  28.   e.Free;
  29. end;