Recent

Author Topic: Regex  (Read 3046 times)

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Regex
« on: June 18, 2019, 03:05:31 pm »
Can't see how can '>.*?<' regex produce '<' or '>' and even ''

somestring is utf-8 encoded and is like  '<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><title>Suessgraeser</title>'
+ some linefeeds here and there.


Code: Pascal  [Select][+][-]
  1.  
  2.   re := TRegExpr.Create('>.*?<');
  3.   re.ModifierI:=true;
  4.   re.Expression:= '>.*?<';
  5.   re.InputString:=somestring;  
  6. try
  7.     if re.Exec(re.InputString) then
  8.     begin
  9.       writeln(re.Match[0]);
  10.       while re.ExecNext do
  11.       begin
  12.          writeln(re.Match[0]);
  13.       end;
  14.     end;
  15.   finally
  16.     re.Free;
  17.   end;    
  18.  

output:
Quote
>
<
>
<
>
<
>Suessgraeser<
>
<
>

<
>

<
>

<
>

<
>
<
>
<
>
<
><
...
lazarus 3.2-fpc-3.2.2-win32/win64

Thaddy

  • Hero Member
  • *****
  • Posts: 14197
  • Probably until I exterminate Putin.
Re: Regex
« Reply #1 on: June 18, 2019, 03:08:43 pm »
Try the unit Uregexpr instead of regexpr. Uregexpr is UTF16.
Assign the text to search to a unicodestring before using it.
And convert back to UTF8 afterwards if necessary.

The presence of this uregexpr depends on the FPC version. It is for sure in 3.2.0 and higher, not sure about 3.0.4.
Functionality is the exact same.
« Last Edit: June 18, 2019, 03:17:15 pm by Thaddy »
Specialize a type, not a var.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11382
  • FPC developer.
Re: Regex
« Reply #2 on: June 18, 2019, 03:26:19 pm »
.* matches any character, so also complete tags. I'm no regex expert, but probably  you need to exclude <> from the set of chars between > and <

Thaddy

  • Hero Member
  • *****
  • Posts: 14197
  • Probably until I exterminate Putin.
Re: Regex
« Reply #3 on: June 18, 2019, 03:31:21 pm »
That's correct. This expression returns >something< as I demo'd in the other question by OP.
An easy way to solve it - since (u)regexpr has no look-ahead - is by a sub-expression, but there are other ways as described on the official website.
« Last Edit: June 18, 2019, 03:34:33 pm by Thaddy »
Specialize a type, not a var.

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: Regex
« Reply #4 on: June 18, 2019, 08:03:27 pm »
Both rergex examples writeln writes single < and > as in oP

Quote
>.*?<
https://regex101.com/r/nBy7uK/1
Quote
>[^<>]+<
https://regex101.com/r/XH54hH/1

This looks good to me both. The second for some reason even more exact.


I don't see any single, colored < or > there.


BTW: it appears there is no uregexpr here. I modified regular regexpr according to last post here:
https://forum.lazarus.freepascal.org/index.php/topic,39578.msg272131.html#msg272131
and it works .


« Last Edit: June 18, 2019, 08:30:03 pm by BubikolRamios »
lazarus 3.2-fpc-3.2.2-win32/win64

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: Regex
« Reply #5 on: June 19, 2019, 02:05:59 am »
ahh , the prob is that in console this
Quote
<
>
looks like twice written, in fact that is once written: < + linefeed + >
lazarus 3.2-fpc-3.2.2-win32/win64

Thaddy

  • Hero Member
  • *****
  • Posts: 14197
  • Probably until I exterminate Putin.
Re: Regex
« Reply #6 on: June 19, 2019, 08:28:10 am »
uregexpr.pp is added ten months ago. So needs 3.2.0 or trunk. It is indeed not in 3.0.4. You can simply download the unit from the trunk. It does not contain code that relies on newer features than 3.0.4.
See this link: https://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/packages/regexpr/src/uregexpr.pp?view=log or use svn direct.
Put it in /packages/regexpr/src
« Last Edit: June 19, 2019, 08:36:38 am by Thaddy »
Specialize a type, not a var.

BubikolRamios

  • Sr. Member
  • ****
  • Posts: 258
Re: Regex
« Reply #7 on: June 21, 2019, 02:03:57 pm »
I guess there is a bug in tregexp
https://regex101.com/r/XH54hH/1

it finds
Quote
>= 3 )) ||
  ((navigator.appName == "Microsoft Internet Explorer") &&
  (parseInt(navigator.appVersion) >= 4 )));
function MSFPpreload(img)
{
  var a=new Image(); a.src=img; return a;
}
// --><

as a match.

« Last Edit: June 21, 2019, 02:05:41 pm by BubikolRamios »
lazarus 3.2-fpc-3.2.2-win32/win64

rvk

  • Hero Member
  • *****
  • Posts: 6109
Re: Regex
« Reply #8 on: June 21, 2019, 03:30:47 pm »
I guess there is a bug in tregexp
https://regex101.com/r/XH54hH/1
it finds
<snip>
as a match.
Are you sure. Not for me with that regex.

totya

  • Hero Member
  • *****
  • Posts: 720
Re: Regex
« Reply #9 on: July 16, 2019, 11:42:22 am »
uregexpr.pp is added ten months ago. So needs 3.2.0 or trunk. ...

Hi!

Uregexpr usage is slightly complicated, because need to convert between Unicode <> (UTF8) string.

As I see, uregexpr is unnecessary anymore, because the latest trunk version of regexpr handle the freepascal (UTF-8) strings very well.

Have nice day.

justnewbie

  • Sr. Member
  • ****
  • Posts: 292
Re: Regex
« Reply #10 on: July 16, 2019, 01:15:18 pm »
Maybe someone can help me.
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
« Last Edit: July 16, 2019, 01:17:50 pm by justnewbie »

440bx

  • Hero Member
  • *****
  • Posts: 3944
Re: Regex
« Reply #11 on: July 16, 2019, 01:33:21 pm »
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
Presuming that the text you got from the regular expression is (or was at one time) in a string, you could simply use StringReplace https://www.freepascal.org/docs-html/rtl/sysutils/stringreplace.html to replace the occurrences of \n with an empty string.
(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

justnewbie

  • Sr. Member
  • ****
  • Posts: 292
Re: Regex
« Reply #12 on: July 16, 2019, 02:04:59 pm »
I have a text that I got by using a regular expression. This text contains some new-line (\n) tokens.
How can I remove these new-lines within the text (ie. I need the text in 1 line)?
Please note: I don't want to remove all new-lines from my whole text, only within the text-part that I got by the regex.
Presuming that the text you got from the regular expression is (or was at one time) in a string, you could simply use StringReplace https://www.freepascal.org/docs-html/rtl/sysutils/stringreplace.html to replace the occurrences of \n with an empty string.
Thank you, but I need to use the ReplaceRegExpr.
I got the solution here https://forum.lazarus.freepascal.org/index.php/topic,46098.msg327483.html#msg327483

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 1313

 

TinyPortal © 2005-2018