Lazarus

Free Pascal => Beginners => Topic started by: howisitpossible on January 13, 2021, 08:18:30 pm

Title: Help with regex
Post by: howisitpossible on January 13, 2021, 08:18:30 pm
I'am trying to launch my code but it crashes on line
Code: Pascal  [Select][+][-]
  1. if (reg.Exec(txt)) then
.
Textfile is html-page:https://en.wikipedia.org/wiki/D_(programming_language) (https://en.wikipedia.org/wiki/D_(programming_language))

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   RegExpr;
  5. var
  6.   txt, buf, filename:string;
  7.   fileHtml:Textfile;
  8.   reg:TRegExpr;
  9. begin
  10.   write('Name of html-file: ');
  11.   read(filename);
  12.  
  13.   AssignFile(fileHtml, filename);
  14.   Reset(fileHtml);
  15.  
  16.   txt:='';
  17.   while not Eof(fileHtml) do
  18.   begin
  19.     Readln(fileHtml, buf);
  20.     txt:=txt+buf;
  21.   end;
  22.  
  23.   reg:=TRegExpr.Create('(<img.+?>)(?![\s\S]*\1)');
  24.   if (reg.Exec(txt)) then
  25.   repeat
  26.     writeln(reg.Match[0]);
  27.   until not reg.ExecNext;
  28.  
  29. end.
  30.  
Title: Re: Help with regex
Post by: BlueIcaro on January 13, 2021, 08:59:19 pm
Hi, I'm not a expert in RegExpresion. But If I put the mouse over the line 23, lazarus gives a warning about a error in expression.
See attach image
Title: Re: Help with regex
Post by: AlexTP on January 13, 2021, 09:12:41 pm
Did you see the error in terminal?
Look
Quote
[user:~]$ ./regtst
Name of html-file: tst.html
An unhandled exception occurred at $0000000000430B5A:
ERegExpr: TRegExpr compile: unrecognized modifier (pos 22)

It means that RegExpr gives error for regex.
Title: Re: Help with regex
Post by: howisitpossible on January 13, 2021, 09:14:07 pm
I can't understand what is wrong
Title: Re: Help with regex
Post by: howisitpossible on January 13, 2021, 09:16:19 pm
Hi, I'm not a expert in RegExpresion. But If I put the mouse over the line 23, lazarus gives a warning about a error in expression.
See attach image
Hmm, i have no warnings
Title: Re: Help with regex
Post by: BlueIcaro on January 13, 2021, 09:17:21 pm
Hi, see the picture attached. Debuger raises a exception: unrecognized modifier(pos2)

I'm testing in Ubuntu x64 20.04 with Lazarus 2.10 with this code
Code: [Select]
program project1;

{$mode objfpc}{$H+}

uses {$IFDEF UNIX} {$IFDEF UseCThreads}
  cthreads, {$ENDIF} {$ENDIF}
  Classes { you can add units after this },
  RegExpr;

var
  txt, buf, filename: string;
  fileHtml: Textfile;
  reg: TRegExpr;
begin
  //write('Name of html-file: ');
  //read(filename);

  //AssignFile(fileHtml, filename);
  //Reset(fileHtml);

  //txt:='';
  //while not Eof(fileHtml) do
  //begin
  //  Readln(fileHtml, buf);
  //  txt:=txt+buf;
  //end;

  txt := '<html>  <head>     <title>Esta es mi primera pagina</title> </head> <body>    <h1>Esto es un encabezado</h1>    <p>Y esto es un parrafo, donde podemos escribir todo el rollo que se nos ocurra.</body></html>';

  reg := TRegExpr.Create('(<img.+?>)(?![\s\S]*\1)');



  if (reg.Exec(txt)) then
    repeat
      writeln(reg.Match[0]);
    until not reg.ExecNext;

end. 

/BlueIcaro
Title: Re: Help with regex
Post by: AlexTP on January 13, 2021, 09:17:30 pm
Regex uses ASSERTION which is supported only in last TRegExpr from https://github.com/andgineer/TRegExpr
Title: Re: Help with regex
Post by: howisitpossible on January 13, 2021, 09:19:02 pm
Can you help me to correct this regex?
Title: Re: Help with regex
Post by: BlueIcaro on January 13, 2021, 09:21:46 pm
Can you help me to correct this regex?

Here is some info
https://regex.sorokin.engineer/en/latest/tregexpr.html (https://regex.sorokin.engineer/en/latest/tregexpr.html)

/BlueIcaro
Title: Re: Help with regex
Post by: paweld on January 13, 2021, 09:28:52 pm
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr;
  5.  
  6. var
  7.   filename:string;
  8.   sl: TStringList;
  9.   reg:TRegExpr;
  10.  
  11. begin
  12.   write('Name of html-file: ');
  13.   read(filename);
  14.   sl := TStringList.Create;
  15.   sl.LoadFromFile(filename);
  16.   reg := TRegExpr.Create('<img.*?>');
  17.   if (reg.Exec(sl.Text)) then
  18.   repeat
  19.     writeln(reg.Match[0]);
  20.   until not reg.ExecNext;
  21.   sl.Free;
  22. end.
  23.  
Title: Re: Help with regex
Post by: howisitpossible on January 13, 2021, 09:35:05 pm
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr;
  5.  
  6. var
  7.   filename:string;
  8.   sl: TStringList;
  9.   reg:TRegExpr;
  10.  
  11. begin
  12.   write('Name of html-file: ');
  13.   read(filename);
  14.   sl := TStringList.Create;
  15.   sl.LoadFromFile(filename);
  16.   reg := TRegExpr.Create('<img.*?>');
  17.   if (reg.Exec(sl.Text)) then
  18.   repeat
  19.     writeln(reg.Match[0]);
  20.   until not reg.ExecNext;
  21.   sl.Free;
  22. end.
  23.  
does it find without repetitions?
Title: Re: Help with regex
Post by: sstvmaster on January 13, 2021, 10:37:28 pm
You can use this RegEx: <img.*?src="(.*?)"[^\>]+> with <img ... /> tags

This RegEx needs too much steps: (<img.+?>)(?![\s\S]*\1)

With reg.Match[0] it is with <img ... /> tags
With reg.Match[1] it is the image url only

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr, SysUtils;
  5.  
  6. var
  7.   i: Integer;
  8.   sl: TStringList;
  9.   reg: TRegExpr;
  10.   sum: String = 'Found %d matches in ~%dms.';
  11.   cstart, cstop: Int64;
  12. begin
  13.   cstart := GetTickCount64;
  14.   i := 0;
  15.   sl := TStringlist.Create;
  16.   sl.LoadFromFile('wikipedia.txt');
  17.   reg := TRegExpr.Create;
  18.  
  19.   reg.Expression := '<img.*?src="(.*?)"[^\>]+>';
  20.  
  21.   if reg.Exec(sl.Text) then
  22.   repeat
  23.     WriteLn(reg.Match[0]); // with <img> tags
  24.     //WriteLn(reg.Match[1]); // image url only
  25.     Inc(i);
  26.   until not reg.ExecNext();
  27.   cstop := GetTickCount64-cstart;
  28.  
  29.   WriteLn;
  30.   WriteLn(Format(sum, [i, cstop]));
  31.  
  32.   ReadLn;
  33. end.
  34.  
Title: Re: Help with regex
Post by: sstvmaster on January 13, 2021, 11:30:05 pm
@AlexTP

this RegEx do not work: <img[^>]+src=(?:"|'')\K(.[^">]+?)(?="|'')

I use your lastest regex!!!
Title: Re: Help with regex
Post by: AlexTP on January 14, 2021, 12:58:02 am
Just read the docs. At Sorokin's site. \K is NOT supported. The regex error tells that
Title: Re: Help with regex
Post by: trev on January 14, 2021, 01:52:13 am
Code: Pascal  [Select][+][-]
  1. reg:=TRegExpr.Create('(<img[^>]+)(src=["\''])([^"\'']+)');
  2.    if (reg.Exec(txt)) then
  3.      repeat
  4.         writeln('0: ',reg.Match[0]);
  5.         writeln('1: ',reg.Match[1]);
  6.         writeln('2: ',reg.Match[2]);
  7.         writeln('3: ',reg.Match[3]);
  8.      until not reg.ExecNext;

produces:

Quote
0: <img src="/static/images/footer/poweredby_mediawiki_88x31.png
1: <img
2: src="
3: /static/images/footer/poweredby_mediawiki_88x31.png

To explain what is happening (in my understanding):

* the parentheses are grouping the expressions to match
* the first match (match[0]) is on expressions starting with <img up to, but not including, > (the ^ in [^>]+ means do not match > and the + means match one or more characters, so match everything up but not including the closing >)
* the second match (match[1]) is on the expression <img
* the third match (match[2]) is on the expression in the first matched expression which matches src=" or src='
* the fourth match (match[3]) is on the expression in the first match which matches everything after src=" or src=' up to, but not including, " or ' (the image url in this case, hence outputting reg.match[3] in the writeln is what you want).
TinyPortal © 2005-2018