Recent

Author Topic: Extract URL list from string  (Read 3303 times)

xinyiman

  • Hero Member
  • *****
  • Posts: 2261
    • Lazarus and Free Pascal italian community
Extract URL list from string
« on: February 10, 2015, 12:11:12 pm »
Hello guys , who can tell me how to extract the list of all URLs contained in a string ? Thank You
Win10, Ubuntu and Mac
Lazarus: 2.1.0
FPC: 3.3.1

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: Extract URL list from string
« Reply #1 on: February 10, 2015, 03:19:53 pm »
Well, parsing the string and finding the standard format of URLs, getting them from "http://" until the next space char.

You could also use a TStringList with #32 as delimiter and then check the lines for any "http://" in the first position.
« Last Edit: February 10, 2015, 03:23:18 pm by typo »

Leledumbo

  • Hero Member
  • *****
  • Posts: 8835
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Extract URL list from string
« Reply #2 on: February 10, 2015, 05:13:39 pm »
DiegoPerini has written a perfectly tested regex to find a URL, so you can make a good use of it. I haven't tested it using our regexpr unit, though.

EDIT:
It needs some modifications due to negative lookahead and non-capturing subpattern (simply remove all ?! and ?: after opening parentheses). But even after that, you must still modify regexpr unit to increase NSUBEXPR constant, 15 is not enough for this regex (my test found that 25 is not enough, 35 is enough, so it's somewhere between). One last thing, activate {$define unicode} or modify all occurences of \x{xxxx} to \x{xx} (if you don't want unicode support).

Attached is a test program using test data from link above, which unfortunately doesn't return all true for the "expected to be true" part.
« Last Edit: February 10, 2015, 08:22:12 pm by Leledumbo »

JZS

  • Full Member
  • ***
  • Posts: 205
Re: Extract URL list from string
« Reply #3 on: February 10, 2015, 05:25:02 pm »
It depends on variations of URLs, you are expecting your string to contain.
Technically the following are considered URLs, do they sound like possibility to be in your string?

www.website.com
website.com
website.it
https://www.website.com
https://website.subwebsite.org
http://www.website.com
ftp://...

Also the URL might contain IP, port or path
domain:port
bit.ly/1abcde7
123.123.123.123

Read this link to understand what I mean:
http://en.wikipedia.org/wiki/Uniform_resource_locator
I use recent stable release

 

TinyPortal © 2005-2018