Lazarus

Free Pascal => General => Topic started by: MarkMLl on May 22, 2022, 05:44:14 pm

Title: Regex conversion escapes
Post by: MarkMLl on May 22, 2022, 05:44:14 pm
Considering the RegExpr library that comes with FPC 3.2.2:

Does anybody know whether it is possible to use a conversion escape to change the case of a sequence of characters to lowercase? Using this:

Code: [Select]
Pattern:

^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$

Replacement:

${1}0x\L$2\E$3

I find the first digit of $2 converted, after which E is output literally with the remainder unchanged.

This is for the final stages of some user-defined text transformations (decoding captured data from a logic analyser... if it can disassemble x86 it can do damn well anything), so I would like it to be possible to specify that e.g. hex output is in the appropriate form.

MarkMLl
Title: Re: Regex conversion escapes
Post by: Jurassic Pork on May 22, 2022, 05:53:20 pm
hello,
have you an example with data :
input ,  expected output ?
Friendly, J.P
Title: Re: Regex conversion escapes
Post by: MarkMLl on May 22, 2022, 06:35:25 pm
It appears to be mostly working but erratic:

Code: Pascal  [Select][+][-]
  1. program testRegex;
  2.  
  3. uses Regexpr;
  4.  
  5. const
  6.   pattern= '^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$';
  7.   substitution= '${1}0x\L$2\E$3';
  8.   test1= 'Test1: 0x0123456789ABCDEF something';
  9.   test2= 'Test2: 0xFEDCBA9876543210 something';
  10.  
  11. var
  12.   regex: TRegExpr;
  13.  
  14. begin
  15.   regex := TRegExpr.Create;
  16.   regex.Expression := pattern;
  17.   if regex.Exec(test1) then
  18.     WriteLn(regex.Substitute(substitution))
  19.   else
  20.     WriteLn('Failed');
  21.   if regex.Exec(test2) then
  22.     WriteLn(regex.Substitute(substitution))
  23.   else
  24.     WriteLn('Failed');
  25.   regex.Free
  26. end.
  27.  

What I'm getting as a result is

Code: [Select]
Test1: 0x0123456789abcdefE something
Test2: 0xfedcba9876543210E something

i.e. it's ignoring the \E. That's slightly different from what I (think I) saw earlier, where only a single character was being converted... I've just tinkered with a few more patterns and data and can't duplicate that.

MarkMLl
Title: Re: Regex conversion escapes
Post by: Jurassic Pork on May 22, 2022, 07:30:15 pm
\E seems to be not necessary to stop the changing case in substitute with group.
With this :
Code: Pascal  [Select][+][-]
  1. const
  2.   pattern= '^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$';
  3.   substitution= '${1}0x\L$2$3';
  4.   test1= 'Test1: 0x0123456789ABCDEF SomeThing';
  5.   test2= 'Test2: 0xFEDCBA9876543210 SomeThing';    

I get this :
Quote
Test1: 0x0123456789abcdef SomeThing
Test2: 0xfedcba9876543210 SomeThing
Title: Re: Regex conversion escapes
Post by: MarkMLl on May 22, 2022, 07:52:11 pm
I see what's happening. First, I thought I'd read through the page fairly carefully but now I see https://wiki.lazarus.freepascal.org/RegEx_packages#Change_case_on_replaces and there's no explicit \E as an "end of conversion" marker. It's also got the capability of messing up \n etc., which is unfortunate since I've gone to some trouble to allow the user to tune that to exactly what he wants.

Second, it's defaulting to "non-greedy" which I think is a comparatively recent change (and isn't what Perl etc. does).

Thanks for looking, I think everything makes sense now.

Very slightly later: it's possible to simulate the missing \E provided that the first non-match is non-alpha:

Code: [Select]
Without conversion:

-- --- CS FFFF0 B8  [Opcode 0xB8]
-- --- CS FFFF1 FF  [Immediate low 0x00FF]
-- --- CS FFFF2 FF  [Immediate 0xFFFF]
-- ---    FFFF0         MOV AX,FFFFH
-- --- CS FFFF3 8E  [Opcode 0x8E]
-- --- CS FFFF4 D8  [mod-reg-r/m 0xD8]
-- ---    FFFF3         MOV DS,AX

Using this Perl-style rule:

/^(.+?)0x([0-9A-Fa-f]+)(.*)$/${1}0x\L$2\u$3/g

-- --- CS FFFF0 B8  [Opcode 0xb8]
-- --- CS FFFF1 FF  [Immediate low 0x00ff]
-- --- CS FFFF2 FF  [Immediate 0xffff]
-- ---    FFFF0         MOV AX,FFFFH
-- --- CS FFFF3 8E  [Opcode 0x8e]
-- --- CS FFFF4 D8  [mod-reg-r/m 0xd8]
-- ---    FFFF3         MOV DS,AX

All of the output there is rule-generated, with the final detail of formatting handled by that regex.

I think that's good enough :-)

MarkMLl
Title: Re: Regex conversion escapes
Post by: AlexTP on May 22, 2022, 08:13:00 pm
Quote
>>Does anybody know whether it is possible to use a conversion escape to change the case of a sequence of characters to lowercase?

It is documented in the wiki,
https://wiki.freepascal.org/RegEx_packages#Change_case_on_replaces
Title: Re: Regex conversion escapes
Post by: AlexTP on May 22, 2022, 08:19:24 pm
And in the off-docs,
https://regex.sorokin.engineer/en/latest/tregexpr.html#substitute
TinyPortal © 2005-2018