Recent

Author Topic: How to automate copying the source code of several consecutive pages on the net  (Read 4065 times)

zamronypj

  • Full Member
  • ***
  • Posts: 133
    • Fano Framework, Free Pascal web application framework

Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).

The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).

If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.

I'd preferred to do it under Pascal with Lazarus!

Using pascal will not be trivial although it is possible.

Running chrome as headless browser, you can use --dump-dom to get DOM of a webpage

Code: [Select]
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&....
It will output DOM to STDOUT, you can use TProcess to run it and get the output or by pipe above command with your own application

Code: [Select]
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&.... | yourownapp
For latter, you can read chrome output from STDIN and then you can process DOM using regex to extract part which you interest. This may or may not what you are looking.

For advanced use case, (e.g send command to simulate mouse click programatically), you may want to take a  look at chrome devtools protocol. It allows external application to control chrome headless browser.

« Last Edit: September 13, 2019, 01:02:55 am by zamronypj »
Fano Framework, Free Pascal web application framework https://fanoframework.github.io
Apache module executes Pascal program like scripting language https://zamronypj.github.io/mod_pascal/
Github https://github.com/zamronypj

franzala

  • New Member
  • *
  • Posts: 39

Many thanks for your hints; I will try to use chrome as headless browser and look at chrome devtools protocol in order to see if I'm able to understand or not

 

TinyPortal © 2005-2018