Recent

Author Topic: Is this a case for a RegEx?  (Read 2162 times)

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Is this a case for a RegEx?
« on: February 15, 2020, 05:47:18 pm »
Hi,

Given that the tools we have to use at work are cheap and therefor not very elegant, I have yet another challenge.

Our electronic presciption system is webbased.
Almost every day I have to paste current medication into a Word document.

That doesn't sound like it's complicated, just past the bullet list as plain text.
No such luck...

Each and every browser puts the contents of the selected text differently on the clipboard (as plain text).

Considering I copied a text like this:
  • calciumcarbonaat/colecalciferol 1 dd 1 tablet van 1,25g/800ie
  • ciprofloxacine 2 dd 1 tablet van 500mg (= 500 mg)
  • koelzalf zo nodig
  • lactulose 1 dd 15 milliliter van 670mg/ml

I want to copy that, and paste it as plain text (no bullets).

Here's what I get from the clipboard depending on the browser used:

IE 11.0
Code: [Select]
calciumcarbonaat/colecalciferol •1 dd 1 tablet van 1,25g/800ie







min






























ciprofloxacine •2 dd 1 tablet van 500mg (= 500 mg)






























min





























koelzalf •zo nodig







min






























lactulose •1 dd 15 milliliter van 670mg/ml





























min































macrogol/zouten •1 dd 1 stuk




Notice that IE is the only browser that puts "min" in there.

MS Edge
Code: [Select]

calciumcarbonaat/colecalciferol
1 dd 1 tablet van 1,25g/800ie



ciprofloxacine
2 dd 1 tablet van 500mg (= 500 mg)



koelzalf
zo nodig



lactulose
1 dd 15 milliliter van 670mg/ml


FF 72 (Windows)
Code: [Select]

calciumcarbonaat/colecalciferol

    1 dd 1 tablet van 1,25g/800ie


ciprofloxacine

    2 dd 1 tablet van 500mg (= 500 mg)


koelzalf

    zo nodig


lactulose

    1 dd 15 milliliter van 670mg/ml


IE 11 on Win2012 server under Citrix (this is what I get at work)
(You'll notice it's different prescriptons in this case)
Code: [Select]

calciumcarbonaat/colecalciferol



•1 dd 1 tabl van 1,25g/800ie

























































































































































































clomipramine



•1 dd 3 tabl van 10mg (= 30 mg)












































































































































































































































































































































































Notice the copious amount of whitespace and line endings that IE adds.

Can somebody come up with a RegEx that does what I want?
I would like it to also remove the pointless "(= 500 mg)" at the end of a prescription line.
Currently I use 12 consecutive StringReplace calls (order matters here) and then some StringReplace on each line of the resulting stringlist.

If I could use a RegEx then I could simplye store the RegEx in a configuration file and alter that whenever the format of the text changes again (2-4 times a year). Currently I have to adapt the code and rebuild my program (and then manage to secretly put the new executable on the network, which I'm of course not allowed to do).

The bullet signs are #226#128#162 in Utf8.

Needless to say that the provider of this software (FarMedVisie) refuses to offer a view of this actual medication as unformatted text.

Bart

MaxCuriosus

  • Full Member
  • ***
  • Posts: 136
Re: Is this a case for a RegEx?
« Reply #1 on: February 16, 2020, 12:45:25 pm »
Here is my suggestion:

1) Create an empty text file in the download folder or elswhere.
2) Put the mouse cursor on the web page that has the desired data.
3) Ctrl+A to select all the text data.
4) Ctlr+C to copy the selected data.
5) Paste it into the empty text file.
6) Write a prgram to parse the file, extract the desired data, format it and write it out.

It works well with Firefox 68 on Debian 9.

MarkMLl

  • Hero Member
  • *****
  • Posts: 6686
Re: Is this a case for a RegEx?
« Reply #2 on: February 16, 2020, 01:56:12 pm »
I don't think I'd rely 100% on a regex for this. I think that what I'd do would be subclass a TStringList, add methods to merge/split lines and do systematic replacement of e.g. mangled quote characters, and add one or more properties encapsulating regexes and a "ForEach" so that selected regexen could be applied to all lines. Possibly also some predicated methods so that e.g. a line merge would be done where the current and successor lines matched certain patterns. That would give you a fairly comprehensive toolkit that you could apply with a bit of glue code.

I've used regexes extansively with Perl glue for text processing, and the biggest headaches came from lines that needed to be merged depending on context, or processed in some specific way depending on a preceding line.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

Hartmut

  • Hero Member
  • *****
  • Posts: 749
Re: Is this a case for a RegEx?
« Reply #3 on: February 16, 2020, 02:45:11 pm »
Maybe puretext can help you. If you press a configurable Hotkey, then the contents of the clipboard ist pasted into the current application as plain text (all formatting is deleted). The documentation says, that bullets are deleted too. I use it on Windows since many years and find it very helpful, especially when copy/paste from web pages. It's easy to install and you could test, how the result looks in your case.

Of course it cannot delete the "(= 500 mg)" at the end of a line. But it could make your remaining task a lot easier.

asdf1337

  • Jr. Member
  • **
  • Posts: 56
Re: Is this a case for a RegEx?
« Reply #4 on: February 16, 2020, 03:24:55 pm »
Hey,
guess that could be solved easily with the TStringHelpers.

1. Assign all text to string
2. Use String.Split to split by line break
3. Now you have a 6 items in as array (according to Edge or FF) where index 1+2, 3+4, 5+6 are one bullet item
4. Check for position of '(=' in strings and only use the part before
5. Use Trim() to remove unnecessary stuff

Zath

  • Sr. Member
  • ****
  • Posts: 391
Re: Is this a case for a RegEx?
« Reply #5 on: February 16, 2020, 05:36:02 pm »
The reason each and every browser does it different is because of the css presets each browser employs.

HTML 5 and CSS 3 have helped a lot recently but there are still presets around.

https://cssreset.com/scripts/html5-doctor-css-reset-stylesheet/

If you created your own page with the zeroed presets included with a stylesheet then copied that, you'd get a far more standard output.
Not sure about bullet lists though. I hate them full stop.

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #6 on: February 16, 2020, 05:42:22 pm »
Here is my suggestion:

1) Create an empty text file in the download folder or elswhere.
...

How is that different from querying the ClipBoard itself?
The idiotic text in my first post are a result op copy/past into notepad.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #7 on: February 16, 2020, 05:46:38 pm »
If you created your own page with the zeroed presets included with a stylesheet then copied that, you'd get a far more standard output.

Well, I can't.
The text is on the page of a third party.
Nothing I can do to change that.
I've filed several feature request to alter the format of the text: they just are not willing to do that.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #8 on: February 16, 2020, 05:49:26 pm »
Maybe puretext can help you.

Quote from: PuteText
PureText is equivalent to opening Notepad, doing a PASTE, followed by a SELECT-ALL, and then a COPY.

This is equivalent to ClipBoard.GetAsText.
Which produces the text in my first post.

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #9 on: February 16, 2020, 05:50:54 pm »
3. Now you have a 6 items in as array (according to Edge or FF) where index 1+2, 3+4, 5+6 are one bullet item

As you can see that depends on browser settings (and the way the web-page is constructed, which changes ever so often).

Bart

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #10 on: February 16, 2020, 05:53:18 pm »
I don't think I'd rely 100% on a regex for this.

Seems like it is not.
It would have been so nice, especially the fact that it would be configurable.

@All: thanks for your input.
I'll keep fighting that damned web-based electronic prescription system.

Bart

winni

  • Hero Member
  • *****
  • Posts: 3197
Re: Is this a case for a RegEx?
« Reply #11 on: February 16, 2020, 06:04:23 pm »
Hi!

Build a Lazarus Mini-App:

One Memo with wordwrap disabled
Button1:Copy from Clipboard to Memo
Button2: Delete Empty lines and leading blanks
Button3: Copy Memo To Clipboard
Button4: Clear Memo

How about that?

Winni

Bart

  • Hero Member
  • *****
  • Posts: 5290
    • Bart en Mariska's Webstek
Re: Is this a case for a RegEx?
« Reply #12 on: February 16, 2020, 08:43:58 pm »
How about that?

That won't do the job.
And it can be done in memory (basically that's how I start cleaning it all up).

Bart

MaxCuriosus

  • Full Member
  • ***
  • Posts: 136
Re: Is this a case for a RegEx?
« Reply #13 on: February 16, 2020, 10:06:58 pm »
How about saving the entire page, parsing the html file, ignoring the corresponding folder if any. It's more laborious but it can be automated.

By the way is the bullet really a character or a micro picture?

Have you analized the html page to see what kind of special characters are embedded in or around the prescription strings?

argb32

  • Jr. Member
  • **
  • Posts: 89
    • Pascal IDE based on IntelliJ platform
Re: Is this a case for a RegEx?
« Reply #14 on: February 16, 2020, 10:35:32 pm »
Regular expression for these cases are quite simple. For example, first text can be formatted with find/replace of the regex:
\n{2,}(min\n+)?
with replacement string #13#10

 

TinyPortal © 2005-2018