Recent

Author Topic: Regex conversion escapes  (Read 774 times)

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Regex conversion escapes
« on: May 22, 2022, 05:44:14 pm »
Considering the RegExpr library that comes with FPC 3.2.2:

Does anybody know whether it is possible to use a conversion escape to change the case of a sequence of characters to lowercase? Using this:

Code: [Select]
Pattern:

^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$

Replacement:

${1}0x\L$2\E$3

I find the first digit of $2 converted, after which E is output literally with the remainder unchanged.

This is for the final stages of some user-defined text transformations (decoding captured data from a logic analyser... if it can disassemble x86 it can do damn well anything), so I would like it to be possible to specify that e.g. hex output is in the appropriate form.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: Regex conversion escapes
« Reply #1 on: May 22, 2022, 05:53:20 pm »
hello,
have you an example with data :
input ,  expected output ?
Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Regex conversion escapes
« Reply #2 on: May 22, 2022, 06:35:25 pm »
It appears to be mostly working but erratic:

Code: Pascal  [Select][+][-]
  1. program testRegex;
  2.  
  3. uses Regexpr;
  4.  
  5. const
  6.   pattern= '^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$';
  7.   substitution= '${1}0x\L$2\E$3';
  8.   test1= 'Test1: 0x0123456789ABCDEF something';
  9.   test2= 'Test2: 0xFEDCBA9876543210 something';
  10.  
  11. var
  12.   regex: TRegExpr;
  13.  
  14. begin
  15.   regex := TRegExpr.Create;
  16.   regex.Expression := pattern;
  17.   if regex.Exec(test1) then
  18.     WriteLn(regex.Substitute(substitution))
  19.   else
  20.     WriteLn('Failed');
  21.   if regex.Exec(test2) then
  22.     WriteLn(regex.Substitute(substitution))
  23.   else
  24.     WriteLn('Failed');
  25.   regex.Free
  26. end.
  27.  

What I'm getting as a result is

Code: [Select]
Test1: 0x0123456789abcdefE something
Test2: 0xfedcba9876543210E something

i.e. it's ignoring the \E. That's slightly different from what I (think I) saw earlier, where only a single character was being converted... I've just tinkered with a few more patterns and data and can't duplicate that.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: Regex conversion escapes
« Reply #3 on: May 22, 2022, 07:30:15 pm »
\E seems to be not necessary to stop the changing case in substitute with group.
With this :
Code: Pascal  [Select][+][-]
  1. const
  2.   pattern= '^(.+?)0x([0-9ABCDEFabcdef]+)(.*)$';
  3.   substitution= '${1}0x\L$2$3';
  4.   test1= 'Test1: 0x0123456789ABCDEF SomeThing';
  5.   test2= 'Test2: 0xFEDCBA9876543210 SomeThing';    

I get this :
Quote
Test1: 0x0123456789abcdef SomeThing
Test2: 0xfedcba9876543210 SomeThing
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Regex conversion escapes
« Reply #4 on: May 22, 2022, 07:52:11 pm »
I see what's happening. First, I thought I'd read through the page fairly carefully but now I see https://wiki.lazarus.freepascal.org/RegEx_packages#Change_case_on_replaces and there's no explicit \E as an "end of conversion" marker. It's also got the capability of messing up \n etc., which is unfortunate since I've gone to some trouble to allow the user to tune that to exactly what he wants.

Second, it's defaulting to "non-greedy" which I think is a comparatively recent change (and isn't what Perl etc. does).

Thanks for looking, I think everything makes sense now.

Very slightly later: it's possible to simulate the missing \E provided that the first non-match is non-alpha:

Code: [Select]
Without conversion:

-- --- CS FFFF0 B8  [Opcode 0xB8]
-- --- CS FFFF1 FF  [Immediate low 0x00FF]
-- --- CS FFFF2 FF  [Immediate 0xFFFF]
-- ---    FFFF0         MOV AX,FFFFH
-- --- CS FFFF3 8E  [Opcode 0x8E]
-- --- CS FFFF4 D8  [mod-reg-r/m 0xD8]
-- ---    FFFF3         MOV DS,AX

Using this Perl-style rule:

/^(.+?)0x([0-9A-Fa-f]+)(.*)$/${1}0x\L$2\u$3/g

-- --- CS FFFF0 B8  [Opcode 0xb8]
-- --- CS FFFF1 FF  [Immediate low 0x00ff]
-- --- CS FFFF2 FF  [Immediate 0xffff]
-- ---    FFFF0         MOV AX,FFFFH
-- --- CS FFFF3 8E  [Opcode 0x8e]
-- --- CS FFFF4 D8  [mod-reg-r/m 0xd8]
-- ---    FFFF3         MOV DS,AX

All of the output there is rule-generated, with the final detail of formatting handled by that regex.

I think that's good enough :-)

MarkMLl
« Last Edit: May 22, 2022, 08:09:10 pm by MarkMLl »
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

AlexTP

  • Hero Member
  • *****
  • Posts: 2386
    • UVviewsoft
Re: Regex conversion escapes
« Reply #5 on: May 22, 2022, 08:13:00 pm »
Quote
>>Does anybody know whether it is possible to use a conversion escape to change the case of a sequence of characters to lowercase?

It is documented in the wiki,
https://wiki.freepascal.org/RegEx_packages#Change_case_on_replaces

AlexTP

  • Hero Member
  • *****
  • Posts: 2386
    • UVviewsoft
Re: Regex conversion escapes
« Reply #6 on: May 22, 2022, 08:19:24 pm »

 

TinyPortal © 2005-2018