Recent

Author Topic: My first regexpr  (Read 438 times)

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7641
My first regexpr
« on: August 10, 2019, 03:39:43 pm »
I have avoided regexpressions till now, but now I have to do some matching than usual, and quickly hacking up pascal string code won't be faster I fear.

I got somewhat on track, but then got stuck

My first problem is matching strings like

  ; this is a title of article part 1
  ; this is a title of article part 10
  ; this is a title of article part, #10

where the "this is a title" is variable.  But it should end in a part clause, and the variable part  isn't allowed to e.g. contain ;

I want to replace them by ($0)  minus the ;, but I stringreplace that away for now.

what i got is

Code: Pascal  [Select]
  1.       s2:=ReplaceRegExpr(';[^;()\?]+\spart[,]*\s[#]*[\d]+',s,'_($0)');

but it fails to match multiple numbers at the end (the "part 10" string), and the one with a comma after part.

Any improvements, simplifications, comments? The expression is probably more complicated due to me experimenting.

p.s. ReplaceRegeexpr is from one of the Sorokin regexpr  example programs, it  just creates the class, assign the first argument to .expression and does a .replace with the second and third arguments.

p.s.2 entering the expression in some online tools does seem to match the multi digit numbers (part 10). Maybe some syntax difference?

Thaddy

  • Hero Member
  • *****
  • Posts: 9303
Re: My first regexpr
« Reply #1 on: August 10, 2019, 05:54:58 pm »
Marco, what's the terminating part? If I have both the start and the terminator this is quite easy, but I can not follow your reasoning.
also related to equus asinus.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7641
Re: My first regexpr
« Reply #2 on: August 10, 2019, 06:31:09 pm »
There can be any data after the number. So the number is the terminating part.

I just don't know why it doesn't match two numbers   \d+ or [\d]+ should match multiple numbers?

Btw  I should have said the first example string goes ok, while the second and third don't.

VTwin

  • Hero Member
  • *****
  • Posts: 799
  • Former Turbo Pascal 3 user
Re: My first regexpr
« Reply #3 on: August 10, 2019, 07:30:54 pm »
Maybe something like:

;.*(part)([^0-9]*)?([0-9]*)

or

;.*(part)([^\d]*)?([\d]*)

saves 2 characters.

Edit, I forgot to disallow ;

;[^;]*(part)([^\d]*)?([\d]*)
« Last Edit: August 10, 2019, 08:42:34 pm by VTwin »
“Talk is cheap. Show me the code.” -Linus Torvalds

macOS 10.13.6: Lazarus 2.0.7 fixes svn 62300 (64 bit Cocoa)
Ubuntu 18.04.3: Lazarus 2.0.6 (64 bit on VBox)
Windows 7 Pro SP1: Lazarus 2.0.6 (64 bit on VBox)
fpc 3.0.4

440bx

  • Hero Member
  • *****
  • Posts: 1294
Re: My first regexpr
« Reply #4 on: August 10, 2019, 08:31:34 pm »
quickly hacking up pascal string code won't be faster I fear.
Is it a requirement to write the program in Pascal or could the program to process the input file be written in a text processing language ?  I ask because, AWK would make what you described trivial to implement.



using FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7641
Re: My first regexpr
« Reply #5 on: August 10, 2019, 08:52:48 pm »
quickly hacking up pascal string code won't be faster I fear.
Is it a requirement to write the program in Pascal

Yes.

Quote
or could the program to process the input file be written in a text processing language ?  I ask because, AWK would make what you described trivial to implement.

It is for windows, so no (though to be honest, I don't really want to on *nix either). A standalone pcre.dll is about the limit of what is doable, and only if absolutely necessary (read: prefer to use regexpr).

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7641
Re: My first regexpr
« Reply #6 on: August 10, 2019, 09:12:10 pm »
;[^;]*(part)([^\d]*)?([\d]*)

Thanks, I think the last \d must be + because otherwise it accepts no number. (which it shouldn't).

While making a shorter unit test, I found out the regex (both yours and mine) is fine, but the problem is unicode related.  There was only one example with the two digit number in it, and it had an unicode apostrophe in it.

Anyway thanks. The expression is cleaner and more structured than mine. I have a basis to continue again.

440bx

  • Hero Member
  • *****
  • Posts: 1294
Re: My first regexpr
« Reply #7 on: August 10, 2019, 09:12:38 pm »
It is for windows, so no (though to be honest, I don't really want to on *nix either). A standalone pcre.dll is about the limit of what is doable, and only if absolutely necessary (read: prefer to use regexpr).
Just FYI, GNU AWK runs under Windows but, if the program has to be written in Pascal then it's not an option.

Also FYI, string processing in AWK is _fast_, really _fast_.  Its optimized algorithms often _beat_ simpler compiled code.

using FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

VTwin

  • Hero Member
  • *****
  • Posts: 799
  • Former Turbo Pascal 3 user
Re: My first regexpr
« Reply #8 on: August 15, 2019, 03:06:44 am »
;[^;]*(part)([^\d]*)?([\d]*)

Thanks, I think the last \d must be + because otherwise it accepts no number. (which it shouldn't).

While making a shorter unit test, I found out the regex (both yours and mine) is fine, but the problem is unicode related.  There was only one example with the two digit number in it, and it had an unicode apostrophe in it.

Anyway thanks. The expression is cleaner and more structured than mine. I have a basis to continue again.

Excellent. Greedy vs non-greedy has thrown me off a few times. I've run into different defaults among implementations.
“Talk is cheap. Show me the code.” -Linus Torvalds

macOS 10.13.6: Lazarus 2.0.7 fixes svn 62300 (64 bit Cocoa)
Ubuntu 18.04.3: Lazarus 2.0.6 (64 bit on VBox)
Windows 7 Pro SP1: Lazarus 2.0.6 (64 bit on VBox)
fpc 3.0.4