Recent

Author Topic: How to form Regex expression  (Read 1035 times)

ronhud

  • Jr. Member
  • **
  • Posts: 82
How to form Regex expression
« on: December 30, 2022, 08:57:22 pm »
I am just starting to look at using TRegExpr.   I want to extract a string of characters from a web page.   I have written an  expression to find 'Change' and capture all the characters until it finds </li. 

RE.Expression := ('Change(.*?)</li'); 

Is this correct?    Using the program RegExpr.pas.   How do I access the result?

Roland57

  • Sr. Member
  • ****
  • Posts: 369
    • GitLab
Re: How to form Regex expression
« Reply #1 on: December 30, 2022, 09:09:02 pm »
Hello!

I suggest to take a look at my RegExpr examples.

AlexTP

  • Hero Member
  • *****
  • Posts: 2046
    • UVviewsoft
Re: How to form Regex expression
« Reply #2 on: December 30, 2022, 10:23:48 pm »
Also it's good to use this site to test regex: https://regex101.com/

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 8748
  • Debugger - SynEdit - and more
    • wiki
Re: How to form Regex expression
« Reply #3 on: December 30, 2022, 11:02:04 pm »
RE.Expression := ('Change(.*?)</li'); 

Is this correct?

Depends... Yes it is.

But it will also match
Code: Text  [Select][+][-]
  1. Change item in list <ul><li>item 1 </li>

Do you want that to be found?

Also you might want to ensure that the regex "</li" does not match "</literal>"
E.g. :  "</li\b"

PierceNg

  • Sr. Member
  • ****
  • Posts: 334
    • SamadhiWeb
Re: How to form Regex expression
« Reply #4 on: December 31, 2022, 03:40:05 am »
I am just starting to look at using TRegExpr.   I want to extract a string of characters from a web page.   I have written an  expression to find 'Change' and capture all the characters until it finds </li. 

RE.Expression := ('Change(.*?)</li'); 

If you want the content but not that tags (aka meta content), first parse the HTML, then apply regex on the content parts to pick out what you want.

The expedient way is to build your program in pas2js, run the program directly on the web page, so you can use the familiar GetElementById and friends.

Otherwise, for HTML parsing, I found this: https://github.com/isemenkov/libpasmyhtml. It's archived but it's a wrapper for a C library so should continue to work.

And for fun reading, http://regex.info/blog/2006-09-15/247, which talks about below famous saying:

Quote
Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems.

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1159
Re: How to form Regex expression
« Reply #5 on: December 31, 2022, 11:16:15 am »
Hello,
if your html file is an xhtml file you can also try to use the Xpath unit of the fcl-xml :
Example :
Code: Pascal  [Select][+][-]
  1. program xPathTestChange;
  2. // J.P December 2022
  3. {$mode objfpc}{$H+}
  4.  
  5. uses classes, DOM, DOM_HTML, SAX_HTML, XPath, StrUtils;
  6. var
  7.   htmlDoc: THTMLDocument;
  8.   XPathRes: TXPathVariable;
  9.   XPathExp: DomString;
  10.   TheNodeSet : TNodeSet;
  11.   Res : UnicodeString;
  12.  
  13. begin
  14.   try
  15.    // read input html file
  16.    ReadHTMLFile(htmlDoc, 'd:\temp\ChangeTest.html');
  17.    // Search for li which contains "Change" string
  18.    XPathExp := '//li[contains(text(),"Change")]';
  19.    XPathRes := EvaluateXPathExpression(XPathExp, htmlDoc.DocumentElement);
  20.    TheNodeSet := XPathRes.AsNodeSet;
  21.    Res := TDomNode(TheNodeSet[0]).TextContent;
  22.    Writeln(MidStr(Res,8,Length(Res)));
  23.    XPathRes.Free;
  24.  
  25.   finally
  26.      htmlDoc.Free;
  27.   end;
  28.   Readln;
  29.  
  30. end.    
  31.  

for this content as html source :
Code: Text  [Select][+][-]
  1. <!DOCTYPE html>
  2. <html>
  3. <body>
  4.  
  5. <h1>The ol and ul elements</h1>
  6.  
  7. <p>The ol element defines an ordered list:</p>
  8. <ol>
  9.   <li>Coffee</li>
  10.   <li>Tea</li>
  11.   <li>Milk</li>
  12. </ol>
  13.  
  14. <p>The ul element defines an unordered list:</p>
  15. <ul>
  16.   <li>Change Coffee</li>
  17.   <li>Tea</li>
  18.   <li>Milk</li>
  19. </ul>
  20.  
  21. </body>
  22. </html>

Result :
Quote
Coffee

Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

ronhud

  • Jr. Member
  • **
  • Posts: 82
Re: How to form Regex expression
« Reply #6 on: December 31, 2022, 11:44:42 am »
Thank you for all the replies.   I am beginning to get the feel of regular expressions and the feeling that I need to look into xhtml.

Roland57

  • Sr. Member
  • ****
  • Posts: 369
    • GitLab
Re: How to form Regex expression
« Reply #7 on: January 01, 2023, 12:30:23 am »
How do I access the result?

In case you still need it, here is an example. The sample is the html document provided by Jurassic Pork.

Code: Pascal  [Select][+][-]
  1. {$MODE OBJFPC}{$H+}
  2.  
  3. uses
  4.   RegExpr;
  5.  
  6. const
  7.   SAMPLE =
  8.     '<!DOCTYPE html>'#10 +
  9.     '<html>'#10 +
  10.     '<body>'#10 +
  11.     ''#10 +
  12.     '<h1>The ol and ul elements</h1>'#10 +
  13.     ''#10 +
  14.     '<p>The ol element defines an ordered list:</p>'#10 +
  15.     '<ol>'#10 +
  16.     '  <li>Coffee</li>'#10 +
  17.     '  <li>Tea</li>'#10 +
  18.     '  <li>Milk</li>'#10 +
  19.     '</ol>'#10 +
  20.     ''#10 +
  21.     '<p>The ul element defines an unordered list:</p>'#10 +
  22.     '<ul>'#10 +
  23.     '  <li>Change Coffee</li>'#10 +
  24.     '  <li>Tea</li>'#10 +
  25.     '  <li>Milk</li>'#10 +
  26.     '</ul>'#10 +
  27.     ''#10 +
  28.     '</body>'#10 +
  29.     '</html>'#10;
  30.  
  31. var
  32.   LExpr: TRegExpr;
  33.  
  34. begin
  35.   WriteLn('==== Demo 1');
  36.  
  37.   LExpr := TRegExpr.Create('<li>Change (\w+)</li>');
  38.   if LExpr.Exec(SAMPLE) then
  39.   begin
  40.     WriteLn(LExpr.Match[0]); // The whole string
  41.     WriteLn(LExpr.Match[1]); // Only the captured group
  42.   end;
  43.   LExpr.Free;
  44.  
  45. { If we look for several strings matching the expression, we use ExecNext. }
  46.  
  47.   WriteLn('==== Demo 2');
  48.  
  49.   LExpr := TRegExpr.Create('<li>([\w\s]+)</li>');
  50.   if LExpr.Exec(SAMPLE) then
  51.   repeat
  52.     WriteLn(LExpr.Match[0]);
  53.     WriteLn(LExpr.Match[1]);
  54.   until not LExpr.ExecNext;
  55.   LExpr.Free;
  56. end.

Regards.

Roland

P.-S. Code edited for better readability.
« Last Edit: January 01, 2023, 12:44:15 pm by Roland57 »

MarkMLl

  • Hero Member
  • *****
  • Posts: 5922
Re: How to form Regex expression
« Reply #8 on: January 01, 2023, 12:12:35 pm »
In any event I think it's worth reminding OP of the much-cited https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags which is a useful read even if one disagrees the the implied definition of "regular language" hence "regular expression".

The issue is that given input that looks like this

Code: [Select]
function foo;

begin
end foo;

(I know that's not HTML etc., I'm just using it to illustrate the point) a strict regex can't check that the final identifier matches the initial one. That's why Perl etc. introduced back-references.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Thaddy

  • Hero Member
  • *****
  • Posts: 12933
Re: How to form Regex expression
« Reply #9 on: January 01, 2023, 03:13:35 pm »

(I know that's not HTML etc., I'm just using it to illustrate the point) a strict regex can't check that the final identifier matches the initial one. That's why Perl etc. introduced back-references.

MarkMLl
Well, "Perl etc" includes TRegExpr. So etc includes FreePascal. And does back-propagation, which I believe is the correct term.
In memory of Gordon Moore  (January 3, 1929 – March 24, 2023) Just double the heaven every two years from now.

MarkMLl

  • Hero Member
  • *****
  • Posts: 5922
Re: How to form Regex expression
« Reply #10 on: January 01, 2023, 03:26:33 pm »
Well, "Perl etc" includes TRegExpr. So etc includes FreePascal. And does back-propagation, which I believe is the correct term.

Yes, but CS purists will say that those implementations are broken since they don't comply with the description of "regular".

I'm not criticising Perl etc. What I'm trying to do is say where the "you can't parse HTML with regexes" doctrine comes from.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Thaddy

  • Hero Member
  • *****
  • Posts: 12933
Re: How to form Regex expression
« Reply #11 on: January 01, 2023, 03:52:36 pm »
In that case you are absolutely right.
In memory of Gordon Moore  (January 3, 1929 – March 24, 2023) Just double the heaven every two years from now.

matjaz

  • New member
  • *
  • Posts: 7
Re: How to form Regex expression
« Reply #12 on: January 04, 2023, 01:49:58 pm »
You can try this:

RE.Expression := ('Change(.*)<\/li');

/ is meta character and you must use escape char \ to capture it
« Last Edit: January 04, 2023, 01:59:29 pm by matjaz »

Warfley

  • Hero Member
  • *****
  • Posts: 1075
Re: How to form Regex expression
« Reply #13 on: January 04, 2023, 02:17:12 pm »
Yes, but CS purists will say that those implementations are broken since they don't comply with the description of "regular".
It's actually a more practical point than just being a purist. If the expressions describe a non regular language, you can't use a DFA for matching. DFA matching is theoretically optimal and can be done in O(n), meaning every character of the matched string must only be visited once, it's the fastest algorithm for matching text that you can have (but on the flip side the memory consumption scales badly with the size of the regex due to the powerset construction). Non-regularity and the requirement for back tracking (not back propagation, thats an ML term) increases the complexity to O(n^2), meaning in the worst case it goes through every character and needs to walk over the rest of the string from that point only to find out at the end that there is no match and continue with the next character.

This is still good enough for inputs that are not too large, and if you are getting compter sciency enough you start calling anything "efficient" that isn't exponential runtime, but when matching large text files this can make a huge difference.

Back to topic, when I have to build regular expressions, I ususally use online tools like https://regex101.com/ as the give direct feedback and show what matches and how it matches
« Last Edit: January 04, 2023, 02:20:11 pm by Warfley »

MarkMLl

  • Hero Member
  • *****
  • Posts: 5922
Re: How to form Regex expression
« Reply #14 on: January 04, 2023, 02:51:17 pm »
/ is meta character and you must use escape char \ to capture it

That's certainly true in the case of Perl etc. where / is the default separator for the m and s commands, but I'm not sure it's true in the case of regex libraries in general.

It's actually a more practical point than just being a purist.

It's certainly a practical detail that's very important to the implementor, but not necessarily to the user.

The really important thing is that regexes extended with back-references (the Perl term, and I believe that Perl was largely responsible for popularising them) /can/ be useful for processing e.g. HTML, particularly if the generator is known to have predictable characteristics.

But doing so is not necessarily good practice, from the POV of both performance and determinacy: far better to use a proper parser.

Which I suppose takes me back to one of my betes noires: programming languages and data description notations which have been promulgated despite being known to be difficult to parse. Which has to leave me arguing that programming syntax /should/ be regular: in the strictest "computer-sciency" sense of the word.

Actually, there's a corollary to that fairly near to home. Wirth designed Modula-2 such that a function definition had to be closed with the (case-sensitive) function name, making the syntax non-regular:

Code: [Select]
function foo;

begin
end foo;

ALGOL-60 on the other hand defined that everything after  end  to the end of the source record was discarded:

Code: [Select]
function foo;

begin
end Any old crap here

Which suggests that even in the late '50s there were people who recognised the undesirability of non-regular syntaxes.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

 

TinyPortal © 2005-2018