Recent

Author Topic: Regex  (Read 937 times)

BubikolRamios

  • Full Member
  • ***
  • Posts: 188
Regex
« on: June 18, 2019, 03:05:31 pm »
Can't see how can '>.*?<' regex produce '<' or '>' and even ''

somestring is utf-8 encoded and is like  '<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><title>Suessgraeser</title>'
+ some linefeeds here and there.


Code: Pascal  [Select]
  1.  
  2.   re := TRegExpr.Create('>.*?<');
  3.   re.ModifierI:=true;
  4.   re.Expression:= '>.*?<';
  5.   re.InputString:=somestring;  
  6. try
  7.     if re.Exec(re.InputString) then
  8.     begin
  9.       writeln(re.Match[0]);
  10.       while re.ExecNext do
  11.       begin
  12.          writeln(re.Match[0]);
  13.       end;
  14.     end;
  15.   finally
  16.     re.Free;
  17.   end;    
  18.  

output:
Quote
>
<
>
<
>
<
>Suessgraeser<
>
<
>

<
>

<
>

<
>

<
>
<
>
<
>
<
><
...
lazarus-2.0.2-fpc-3.0.4-win32

Thaddy

  • Hero Member
  • *****
  • Posts: 8901
Re: Regex
« Reply #1 on: June 18, 2019, 03:08:43 pm »
Try the unit Uregexpr instead of regexpr. Uregexpr is UTF16.
Assign the text to search to a unicodestring before using it.
And convert back to UTF8 afterwards if necessary.

The presence of this uregexpr depends on the FPC version. It is for sure in 3.2.0 and higher, not sure about 3.0.4.
Functionality is the exact same.
« Last Edit: June 18, 2019, 03:17:15 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7434
Re: Regex
« Reply #2 on: June 18, 2019, 03:26:19 pm »
.* matches any character, so also complete tags. I'm no regex expert, but probably  you need to exclude <> from the set of chars between > and <

Thaddy

  • Hero Member
  • *****
  • Posts: 8901
Re: Regex
« Reply #3 on: June 18, 2019, 03:31:21 pm »
That's correct. This expression returns >something< as I demo'd in the other question by OP.
An easy way to solve it - since (u)regexpr has no look-ahead - is by a sub-expression, but there are other ways as described on the official website.
« Last Edit: June 18, 2019, 03:34:33 pm by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

BubikolRamios

  • Full Member
  • ***
  • Posts: 188
Re: Regex
« Reply #4 on: June 18, 2019, 08:03:27 pm »
Both rergex examples writeln writes single < and > as in oP

Quote
>.*?<
https://regex101.com/r/nBy7uK/1
Quote
>[^<>]+<
https://regex101.com/r/XH54hH/1

This looks good to me both. The second for some reason even more exact.


I don't see any single, colored < or > there.


BTW: it appears there is no uregexpr here. I modified regular regexpr according to last post here:
https://forum.lazarus.freepascal.org/index.php/topic,39578.msg272131.html#msg272131
and it works .


« Last Edit: June 18, 2019, 08:30:03 pm by BubikolRamios »
lazarus-2.0.2-fpc-3.0.4-win32

BubikolRamios

  • Full Member
  • ***
  • Posts: 188
Re: Regex
« Reply #5 on: June 19, 2019, 02:05:59 am »
ahh , the prob is that in console this
Quote
<
>
looks like twice written, in fact that is once written: < + linefeed + >
lazarus-2.0.2-fpc-3.0.4-win32

Thaddy

  • Hero Member
  • *****
  • Posts: 8901
Re: Regex
« Reply #6 on: June 19, 2019, 08:28:10 am »
uregexpr.pp is added ten months ago. So needs 3.2.0 or trunk. It is indeed not in 3.0.4. You can simply download the unit from the trunk. It does not contain code that relies on newer features than 3.0.4.
See this link: https://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/regexpr/src/uregexpr.pp?view=log or use svn direct.
Put it in /packages/regexpr/src
« Last Edit: June 19, 2019, 08:36:38 am by Thaddy »
Most people that want to use threading should learn to patch their jeans first: use a needle.

BubikolRamios

  • Full Member
  • ***
  • Posts: 188
Re: Regex
« Reply #7 on: June 21, 2019, 02:03:57 pm »
I guess there is a bug in tregexp
https://regex101.com/r/XH54hH/1

it finds
Quote
>= 3 )) ||
  ((navigator.appName == "Microsoft Internet Explorer") &&
  (parseInt(navigator.appVersion) >= 4 )));
function MSFPpreload(img)
{
  var a=new Image(); a.src=img; return a;
}
// --><

as a match.

« Last Edit: June 21, 2019, 02:05:41 pm by BubikolRamios »
lazarus-2.0.2-fpc-3.0.4-win32

rvk

  • Hero Member
  • *****
  • Posts: 3836
Re: Regex
« Reply #8 on: June 21, 2019, 03:30:47 pm »
I guess there is a bug in tregexp
https://regex101.com/r/XH54hH/1
it finds
<snip>
as a match.
Are you sure. Not for me with that regex.

totya

  • Hero Member
  • *****
  • Posts: 577
Re: Regex
« Reply #9 on: July 16, 2019, 11:42:22 am »
uregexpr.pp is added ten months ago. So needs 3.2.0 or trunk. ...

Hi!

Uregexpr usage is slightly complicated, because need to convert between Unicode <> (UTF8) string.

As I see, uregexpr is unnecessary anymore, because the latest trunk version of regexpr handle the freepascal (UTF-8) strings very well.

Have nice day.

justnewbie

  • Full Member
  • ***
  • Posts: 225
Re: Regex
« Reply #10 on: July 16, 2019, 01:15:18 pm »
Maybe someone can help me.
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
« Last Edit: July 16, 2019, 01:17:50 pm by justnewbie »

440bx

  • Hero Member
  • *****
  • Posts: 1123
Re: Regex
« Reply #11 on: July 16, 2019, 01:33:21 pm »
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
Presuming that the text you got from the regular expression is (or was at one time) in a string, you could simply use StringReplace https://www.freepascal.org/docs-html/rtl/sysutils/stringreplace.html to replace the occurrences of \n with an empty string.
using FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

justnewbie

  • Full Member
  • ***
  • Posts: 225
Re: Regex
« Reply #12 on: July 16, 2019, 02:04:59 pm »
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
Presuming that the text you got from the regular expression is (or was at one time) in a string, you could simply use StringReplace https://www.freepascal.org/docs-html/rtl/sysutils/stringreplace.html to replace the occurrences of \n with an empty string.
Thank you, but I need to use the ReplaceRegExpr.
I got the solution here https://forum.lazarus.freepascal.org/index.php/topic,46098.msg327483.html#msg327483

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 635