Lazarus

Programming => Networking and Web Programming => Topic started by: franzala on September 05, 2019, 06:54:58 pm

Title: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 05, 2019, 06:54:58 pm: I frequently look at a classified ads site which shows several pages to my requests.

I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.

I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid such a burden.

Could somebody give me a hint to do that.

Thanks in advance
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Remy Lebeau on September 06, 2019, 12:57:59 am: That is a pretty broad topic, given that you didn't provide any details about how you access the site, or what tasks you perform manually that you want to automate.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: trev on September 06, 2019, 04:17:14 am: Quote
I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.

If you've already written the application, I don't quite see how it can be "boring and time consuming" for the application to extract the data from more than one page.

I speculate, as you do not provide the information, that all you have to have your application do extra is to follow the link of one page of results to the next page which sounds pretty trivial. Perhaps you need to elaborate what the issue is?
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 06, 2019, 01:12:19 pm: Hi trev,

thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.

Sorry to ask for a trivial information but I'm not familiar with that stuff.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Remy Lebeau on September 06, 2019, 09:41:18 pm: Quote from: franzala on September 06, 2019, 01:12:19 pm
where can I find the link from one page to the next in the source code (or which word or expression will I have to search);

We can't answer that without seeing the actual page. Do you see such a link when viewing the page in a web browser? Are the pages numbered in their URLs?
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 06, 2019, 10:34:37 pm: At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:

https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3

It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this pages (1 to 8 in this case).
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: valdir.marcos on September 07, 2019, 09:30:37 pm: Quote from: franzala on September 05, 2019, 06:54:58 pm
I frequently look at a classified ads site which shows several pages to my requests.
I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.
I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid such a burden.
Could somebody give me a hint to do that.
Thanks in advance

Quote from: franzala on September 06, 2019, 01:12:19 pm
Hi trev,
thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.
Sorry to ask for a trivial information but I'm not familiar with that stuff.

Quote from: franzala on September 06, 2019, 10:34:37 pm
At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:
https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3
It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this pages (1 to 8 in this case).
Can you attach a small sample project of what you've already done so we can better help you?
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: madref on September 08, 2019, 12:37:27 am: Take a look at this thread.

https://forum.lazarus.freepascal.org/index.php/topic,44814.0.html (https://forum.lazarus.freepascal.org/index.php/topic,44814.0.html)
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 11, 2019, 05:14:25 pm: Short description of what I'm doing up to now

1. I make a search on the ads site which give a answer on for instance 8 pages.

2. I copy manually the source code of each of the 8 pages in 8 files; each source code contains several parts and one of them describes each ad included in the page with its caracteristics.

3. The application I have written extracts from these 8 files the datas which interest me i.e; some of the caracteristics such as the price, the location of the seller..., and save it in a file which I use to compare the several offers (from the same search and also from previous searches)

What I find boring and time consuming, is the manual copy of each page and I'm looking for a way to copy it automatically from the first page which contains the links to the 7 others. Her is a link to such a page:

https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max

I hope that these short explanation will help to clarify my question.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: zamronypj on September 11, 2019, 09:26:38 pm: I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.

If web page content is static and not using JavaScript, you can use TFPHttpClient.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Martin_fr on September 11, 2019, 09:57:00 pm: Question: Do you have code that automatically downloads a page ? Like the firs page, you tell your app is at http://...... and YOUR APP is downloading it?

Because if I read this correctly, everyone here assumes this.
But you said, you copy the page into a file (as in you do "save as" in your browser), and then your app only does open the file on your harddisk?

In the latter case, you may simply need something like lnet.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 12, 2019, 12:16:33 pm: Quote from: zamronypj on September 11, 2019, 09:26:38 pm
I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.

If web page content is static and not using JavaScript, you can use TFPHttpClient.

Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).

The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).

If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.

I'd preferred to do it under Pascal with Lazarus!
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Thaddy on September 12, 2019, 03:46:54 pm: Quote from: franzala on September 12, 2019, 12:16:33 pm
[I'd preferred to do it under Pascal with Lazarus!
It is not going to work in any language, except when you are a really good programmer.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Martin_fr on September 12, 2019, 04:11:35 pm: Quick look at the page.

It seems the server may actively try and block downloads from outside a browser. At least "wget" failed for me. (But it could have been me, not escaping the url correctly...)

You also need to check, if you are legally allowed to automatically process their data.
- For none personal use (if you re-publish the results), this is probably illegal => since you would need a license to the data.
- For personal use (only you look at the results) this may depend on the country you live in...

Anyway you will need a very good understanding of web technologies.
The data appears to be in the page (if you can get the page). But you need to parse the format (and the company may change it, then you start over). But that can be done in pascal.
In any case (and completely independent of what language you use) => it is NOT trivial.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: Handoko on September 12, 2019, 05:45:57 pm: @franzala

As you have been told, what you want to achieve isn't simple. But if you're already good in string manipulation and data storing in Pascal, you may be interested to try a very simple demo posted by @wp:

https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199 (https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199)

I followed the demo, and later created my ~~web~~ forum spam detection program. But because of the limitation of my knowledge, it is half-finished only.

First, using the demo I wrote a very simple web browser. It run extremely slow, so I added my own image caching routine. Much better, unfortunately I can't render the CSS things. By analyzing the website I want to 'scrap', I was able to make the program to follow the links of the pages automatically. But because the website has enormous pages, I had to define the range for scrapping on each run. And I had to slow it down by putting some delay for each pages because some websites or its webhosting's firewall will block incoming users (bots) that tried to request too many pages in a certain of time.

The fetched data is stored in a dbf file, I chose dbf because it needs zero installation and it is capable for the thing I want to do. On each run, the program will download only a certain range of pages and it will update if the pages were already stored. And then I wrote some code to find all the outbound links in the stored data and compare them with my defined blacklist and whitelist keywords. And show a report of pages that contain suspicious spam links.

It sounds simple, just 2 paragraphs to explain it. But actually it took me months to write it. And it became too hard for me so I take a break.

:-X I've been writing a program for spam detection on this forum.
I do not have the license of the data. But it's just for personal use, so I think it is okay.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: zamronypj on September 12, 2019, 06:14:15 pm: Quote from: franzala on September 12, 2019, 12:16:33 pm

Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).

The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).

If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.

I'd preferred to do it under Pascal with Lazarus!

Using pascal will not be trivial although it is possible.

Running chrome as headless browser, you can use --dump-dom to get DOM of a webpage

Code: [Select]
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&....
It will output DOM to STDOUT, you can use TProcess to run it and get the output or by pipe above command with your own application

Code: [Select]
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&.... | yourownapp
For latter, you can read chrome output from STDIN and then you can process DOM using regex to extract part which you interest. This may or may not what you are looking.

For advanced use case, (e.g send command to simulate mouse click programatically), you may want to take a look at chrome devtools protocol. It allows external application to control chrome headless browser.
Title: Re: How to automate copying the source code of several consecutive pages on the net
Post by: franzala on September 12, 2019, 10:36:40 pm: Quote from: zamronypj on September 12, 2019, 06:14:15 pm

Many thanks for your hints; I will try to use chrome as headless browser and look at chrome devtools protocol in order to see if I'm able to understand or not