Recent

Author Topic: Help with regex  (Read 1511 times)

howisitpossible

  • Newbie
  • Posts: 5
Help with regex
« on: January 13, 2021, 08:18:30 pm »
I'am trying to launch my code but it crashes on line
Code: Pascal  [Select][+][-]
  1. if (reg.Exec(txt)) then
.
Textfile is html-page:https://en.wikipedia.org/wiki/D_(programming_language)

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   RegExpr;
  5. var
  6.   txt, buf, filename:string;
  7.   fileHtml:Textfile;
  8.   reg:TRegExpr;
  9. begin
  10.   write('Name of html-file: ');
  11.   read(filename);
  12.  
  13.   AssignFile(fileHtml, filename);
  14.   Reset(fileHtml);
  15.  
  16.   txt:='';
  17.   while not Eof(fileHtml) do
  18.   begin
  19.     Readln(fileHtml, buf);
  20.     txt:=txt+buf;
  21.   end;
  22.  
  23.   reg:=TRegExpr.Create('(<img.+?>)(?![\s\S]*\1)');
  24.   if (reg.Exec(txt)) then
  25.   repeat
  26.     writeln(reg.Match[0]);
  27.   until not reg.ExecNext;
  28.  
  29. end.
  30.  
« Last Edit: January 13, 2021, 08:20:33 pm by howisitpossible »

BlueIcaro

  • Hero Member
  • *****
  • Posts: 791
    • Blog personal
Re: Help with regex
« Reply #1 on: January 13, 2021, 08:59:19 pm »
Hi, I'm not a expert in RegExpresion. But If I put the mouse over the line 23, lazarus gives a warning about a error in expression.
See attach image

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Help with regex
« Reply #2 on: January 13, 2021, 09:12:41 pm »
Did you see the error in terminal?
Look
Quote
[user:~]$ ./regtst
Name of html-file: tst.html
An unhandled exception occurred at $0000000000430B5A:
ERegExpr: TRegExpr compile: unrecognized modifier (pos 22)

It means that RegExpr gives error for regex.

howisitpossible

  • Newbie
  • Posts: 5
Re: Help with regex
« Reply #3 on: January 13, 2021, 09:14:07 pm »
I can't understand what is wrong

howisitpossible

  • Newbie
  • Posts: 5
Re: Help with regex
« Reply #4 on: January 13, 2021, 09:16:19 pm »
Hi, I'm not a expert in RegExpresion. But If I put the mouse over the line 23, lazarus gives a warning about a error in expression.
See attach image
Hmm, i have no warnings

BlueIcaro

  • Hero Member
  • *****
  • Posts: 791
    • Blog personal
Re: Help with regex
« Reply #5 on: January 13, 2021, 09:17:21 pm »
Hi, see the picture attached. Debuger raises a exception: unrecognized modifier(pos2)

I'm testing in Ubuntu x64 20.04 with Lazarus 2.10 with this code
Code: [Select]
program project1;

{$mode objfpc}{$H+}

uses {$IFDEF UNIX} {$IFDEF UseCThreads}
  cthreads, {$ENDIF} {$ENDIF}
  Classes { you can add units after this },
  RegExpr;

var
  txt, buf, filename: string;
  fileHtml: Textfile;
  reg: TRegExpr;
begin
  //write('Name of html-file: ');
  //read(filename);

  //AssignFile(fileHtml, filename);
  //Reset(fileHtml);

  //txt:='';
  //while not Eof(fileHtml) do
  //begin
  //  Readln(fileHtml, buf);
  //  txt:=txt+buf;
  //end;

  txt := '<html>  <head>     <title>Esta es mi primera pagina</title> </head> <body>    <h1>Esto es un encabezado</h1>    <p>Y esto es un parrafo, donde podemos escribir todo el rollo que se nos ocurra.</body></html>';

  reg := TRegExpr.Create('(<img.+?>)(?![\s\S]*\1)');



  if (reg.Exec(txt)) then
    repeat
      writeln(reg.Match[0]);
    until not reg.ExecNext;

end. 

/BlueIcaro
« Last Edit: January 13, 2021, 09:19:06 pm by BlueIcaro »

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Help with regex
« Reply #6 on: January 13, 2021, 09:17:30 pm »
Regex uses ASSERTION which is supported only in last TRegExpr from https://github.com/andgineer/TRegExpr

howisitpossible

  • Newbie
  • Posts: 5
Re: Help with regex
« Reply #7 on: January 13, 2021, 09:19:02 pm »
Can you help me to correct this regex?

BlueIcaro

  • Hero Member
  • *****
  • Posts: 791
    • Blog personal
Re: Help with regex
« Reply #8 on: January 13, 2021, 09:21:46 pm »

paweld

  • Hero Member
  • *****
  • Posts: 966
Re: Help with regex
« Reply #9 on: January 13, 2021, 09:28:52 pm »
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr;
  5.  
  6. var
  7.   filename:string;
  8.   sl: TStringList;
  9.   reg:TRegExpr;
  10.  
  11. begin
  12.   write('Name of html-file: ');
  13.   read(filename);
  14.   sl := TStringList.Create;
  15.   sl.LoadFromFile(filename);
  16.   reg := TRegExpr.Create('<img.*?>');
  17.   if (reg.Exec(sl.Text)) then
  18.   repeat
  19.     writeln(reg.Match[0]);
  20.   until not reg.ExecNext;
  21.   sl.Free;
  22. end.
  23.  
Best regards / Pozdrawiam
paweld

howisitpossible

  • Newbie
  • Posts: 5
Re: Help with regex
« Reply #10 on: January 13, 2021, 09:35:05 pm »
Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr;
  5.  
  6. var
  7.   filename:string;
  8.   sl: TStringList;
  9.   reg:TRegExpr;
  10.  
  11. begin
  12.   write('Name of html-file: ');
  13.   read(filename);
  14.   sl := TStringList.Create;
  15.   sl.LoadFromFile(filename);
  16.   reg := TRegExpr.Create('<img.*?>');
  17.   if (reg.Exec(sl.Text)) then
  18.   repeat
  19.     writeln(reg.Match[0]);
  20.   until not reg.ExecNext;
  21.   sl.Free;
  22. end.
  23.  
does it find without repetitions?

sstvmaster

  • Sr. Member
  • ****
  • Posts: 299
Re: Help with regex
« Reply #11 on: January 13, 2021, 10:37:28 pm »
You can use this RegEx: <img.*?src="(.*?)"[^\>]+> with <img ... /> tags

This RegEx needs too much steps: (<img.+?>)(?![\s\S]*\1)

With reg.Match[0] it is with <img ... /> tags
With reg.Match[1] it is the image url only

Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. uses
  4.   Classes, RegExpr, SysUtils;
  5.  
  6. var
  7.   i: Integer;
  8.   sl: TStringList;
  9.   reg: TRegExpr;
  10.   sum: String = 'Found %d matches in ~%dms.';
  11.   cstart, cstop: Int64;
  12. begin
  13.   cstart := GetTickCount64;
  14.   i := 0;
  15.   sl := TStringlist.Create;
  16.   sl.LoadFromFile('wikipedia.txt');
  17.   reg := TRegExpr.Create;
  18.  
  19.   reg.Expression := '<img.*?src="(.*?)"[^\>]+>';
  20.  
  21.   if reg.Exec(sl.Text) then
  22.   repeat
  23.     WriteLn(reg.Match[0]); // with <img> tags
  24.     //WriteLn(reg.Match[1]); // image url only
  25.     Inc(i);
  26.   until not reg.ExecNext();
  27.   cstop := GetTickCount64-cstart;
  28.  
  29.   WriteLn;
  30.   WriteLn(Format(sum, [i, cstop]));
  31.  
  32.   ReadLn;
  33. end.
  34.  
« Last Edit: January 14, 2021, 12:16:10 am by sstvmaster »
greetings Maik

Windows 10,
- Lazarus 2.2.6 (stable) + fpc 3.2.2 (stable)
- Lazarus 2.2.7 (fixes) + fpc 3.3.1 (main/trunk)

sstvmaster

  • Sr. Member
  • ****
  • Posts: 299
Re: Help with regex
« Reply #12 on: January 13, 2021, 11:30:05 pm »
@AlexTP

this RegEx do not work: <img[^>]+src=(?:"|'')\K(.[^">]+?)(?="|'')

I use your lastest regex!!!
« Last Edit: January 14, 2021, 12:16:53 am by sstvmaster »
greetings Maik

Windows 10,
- Lazarus 2.2.6 (stable) + fpc 3.2.2 (stable)
- Lazarus 2.2.7 (fixes) + fpc 3.3.1 (main/trunk)

AlexTP

  • Hero Member
  • *****
  • Posts: 2365
    • UVviewsoft
Re: Help with regex
« Reply #13 on: January 14, 2021, 12:58:02 am »
Just read the docs. At Sorokin's site. \K is NOT supported. The regex error tells that

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2020
  • Former Delphi 1-7, 10.2 user
Re: Help with regex
« Reply #14 on: January 14, 2021, 01:52:13 am »
Code: Pascal  [Select][+][-]
  1. reg:=TRegExpr.Create('(<img[^>]+)(src=["\''])([^"\'']+)');
  2.    if (reg.Exec(txt)) then
  3.      repeat
  4.         writeln('0: ',reg.Match[0]);
  5.         writeln('1: ',reg.Match[1]);
  6.         writeln('2: ',reg.Match[2]);
  7.         writeln('3: ',reg.Match[3]);
  8.      until not reg.ExecNext;

produces:

Quote
0: <img src="/static/images/footer/poweredby_mediawiki_88x31.png
1: <img
2: src="
3: /static/images/footer/poweredby_mediawiki_88x31.png

To explain what is happening (in my understanding):

* the parentheses are grouping the expressions to match
* the first match (match[0]) is on expressions starting with <img up to, but not including, > (the ^ in [^>]+ means do not match > and the + means match one or more characters, so match everything up but not including the closing >)
* the second match (match[1]) is on the expression <img
* the third match (match[2]) is on the expression in the first matched expression which matches src=" or src='
* the fourth match (match[3]) is on the expression in the first match which matches everything after src=" or src=' up to, but not including, " or ' (the image url in this case, hence outputting reg.match[3] in the writeln is what you want).
« Last Edit: January 14, 2021, 01:54:00 am by trev »

 

TinyPortal © 2005-2018