Recent

Author Topic: How do you extrat all links on a webpage?  (Read 3524 times)

Awesome Programmer

  • Sr. Member
  • ****
  • Posts: 426
  • Programming is FUN only when it works :)
    • Cool Technology
How do you extrat all links on a webpage?
« on: February 23, 2017, 08:35:12 pm »
I am currently using synapse to download file from a webserver. I would like to know download webpage and extract links. Could someone please give me guidance on how to do this in Lazarus?

wp

  • Hero Member
  • *****
  • Posts: 7552
Re: How do you extrat all links on a webpage?
« Reply #1 on: February 23, 2017, 08:44:06 pm »
The attached demo uses the fasthtmlparser (in fpc's packages/chm/src) to extract the "<a href=" tags. The code is very easy, and you certainly can extend it to your needs.
« Last Edit: February 23, 2017, 09:07:12 pm by wp »
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

Awesome Programmer

  • Sr. Member
  • ****
  • Posts: 426
  • Programming is FUN only when it works :)
    • Cool Technology
Re: How do you extrat all links on a webpage?
« Reply #2 on: February 24, 2017, 04:38:16 pm »
The attached demo uses the fasthtmlparser (in fpc's packages/chm/src) to extract the "<a href=" tags. The code is very easy, and you certainly can extend it to your needs.

OMG!!! This is just the demo I was REALLLLLLLLLLY looking for.... I even changed the SEARCH_TAG to see if it will work on other HTML tags. This demo is just too perfect and simple and it WORKS. Thank you so much, wp.

wp

  • Hero Member
  • *****
  • Posts: 7552
Re: How do you extrat all links on a webpage?
« Reply #3 on: February 24, 2017, 05:43:09 pm »
Of course, I hope you know that my code is only a case study trying to learn how to use the fasthtmlparser. In practice, you must consider that the <a> tag can contain other attributes and the href attribute need not be the first one. Therefore, the more general approach would be to use "<a " as SEARCH_TAG and, if it is found, scan the NoCaseTag parameter for the position of the "href=" phrase.
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

 

TinyPortal © 2005-2018