Lazarus
Programming => Networking and Web Programming => Topic started by: franzala on September 05, 2019, 06:54:58 pm
-
I frequently look at a classified ads site which shows several pages to my requests.
I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.
I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid such a burden.
Could somebody give me a hint to do that.
Thanks in advance
-
That is a pretty broad topic, given that you didn't provide any details about how you access the site, or what tasks you perform manually that you want to automate.
-
I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.
If you've already written the application, I don't quite see how it can be "boring and time consuming" for the application to extract the data from more than one page.
I speculate, as you do not provide the information, that all you have to have your application do extra is to follow the link of one page of results to the next page which sounds pretty trivial. Perhaps you need to elaborate what the issue is?
-
Hi trev,
thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.
Sorry to ask for a trivial information but I'm not familiar with that stuff.
-
where can I find the link from one page to the next in the source code (or which word or expression will I have to search);
We can't answer that without seeing the actual page. Do you see such a link when viewing the page in a web browser? Are the pages numbered in their URLs?
-
At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:
https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3
It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this pages (1 to 8 in this case).
-
I frequently look at a classified ads site which shows several pages to my requests.
I have written an little application to extract the datas of the ads from the source code of one page but I have to copy each page to obtain the whole information which is boring an time consuming.
I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid such a burden.
Could somebody give me a hint to do that.
Thanks in advance
Hi trev,
thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.
Sorry to ask for a trivial information but I'm not familiar with that stuff.
At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:
https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3
It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this pages (1 to 8 in this case).
Can you attach a small sample project of what you've already done so we can better help you?
-
Take a look at this thread.
https://forum.lazarus.freepascal.org/index.php/topic,44814.0.html (https://forum.lazarus.freepascal.org/index.php/topic,44814.0.html)
-
Short description of what I'm doing up to now
1. I make a search on the ads site which give a answer on for instance 8 pages.
2. I copy manually the source code of each of the 8 pages in 8 files; each source code contains several parts and one of them describes each ad included in the page with its caracteristics.
3. The application I have written extracts from these 8 files the datas which interest me i.e; some of the caracteristics such as the price, the location of the seller..., and save it in a file which I use to compare the several offers (from the same search and also from previous searches)
What I find boring and time consuming, is the manual copy of each page and I'm looking for a way to copy it automatically from the first page which contains the links to the 7 others. Her is a link to such a page:
https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max
I hope that these short explanation will help to clarify my question.
-
I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.
If web page content is static and not using JavaScript, you can use TFPHttpClient.
-
Question: Do you have code that automatically downloads a page ? Like the firs page, you tell your app is at http://...... and YOUR APP is downloading it?
Because if I read this correctly, everyone here assumes this.
But you said, you copy the page into a file (as in you do "save as" in your browser), and then your app only does open the file on your harddisk?
In the latter case, you may simply need something like lnet.
-
I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.
If web page content is static and not using JavaScript, you can use TFPHttpClient.
Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).
The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).
If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.
I'd preferred to do it under Pascal with Lazarus!
-
[I'd preferred to do it under Pascal with Lazarus!
It is not going to work in any language, except when you are a really good programmer.
-
Quick look at the page.
It seems the server may actively try and block downloads from outside a browser. At least "wget" failed for me. (But it could have been me, not escaping the url correctly...)
You also need to check, if you are legally allowed to automatically process their data.
- For none personal use (if you re-publish the results), this is probably illegal => since you would need a license to the data.
- For personal use (only you look at the results) this may depend on the country you live in...
Anyway you will need a very good understanding of web technologies.
The data appears to be in the page (if you can get the page). But you need to parse the format (and the company may change it, then you start over). But that can be done in pascal.
In any case (and completely independent of what language you use) => it is NOT trivial.
-
@franzala
As you have been told, what you want to achieve isn't simple. But if you're already good in string manipulation and data storing in Pascal, you may be interested to try a very simple demo posted by @wp:
https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199 (https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199)
I followed the demo, and later created my web forum spam detection program. But because of the limitation of my knowledge, it is half-finished only.
First, using the demo I wrote a very simple web browser. It run extremely slow, so I added my own image caching routine. Much better, unfortunately I can't render the CSS things. By analyzing the website I want to 'scrap', I was able to make the program to follow the links of the pages automatically. But because the website has enormous pages, I had to define the range for scrapping on each run. And I had to slow it down by putting some delay for each pages because some websites or its webhosting's firewall will block incoming users (bots) that tried to request too many pages in a certain of time.
The fetched data is stored in a dbf file, I chose dbf because it needs zero installation and it is capable for the thing I want to do. On each run, the program will download only a certain range of pages and it will update if the pages were already stored. And then I wrote some code to find all the outbound links in the stored data and compare them with my defined blacklist and whitelist keywords. And show a report of pages that contain suspicious spam links.
It sounds simple, just 2 paragraphs to explain it. But actually it took me months to write it. And it became too hard for me so I take a break.
:-X I've been writing a program for spam detection on this forum.
I do not have the license of the data. But it's just for personal use, so I think it is okay.
-
Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).
The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).
If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.
I'd preferred to do it under Pascal with Lazarus!
Using pascal will not be trivial although it is possible.
Running chrome as headless browser, you can use --dump-dom to get DOM of a webpage
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&....
It will output DOM to STDOUT, you can use TProcess to run it and get the output or by pipe above command with your own application
chrome --headless --disable-gpu --dump-dom https://www.your-own-app.com/search?bla=blabla&.... | yourownapp
For latter, you can read chrome output from STDIN and then you can process DOM using regex to extract part which you interest. This may or may not what you are looking.
For advanced use case, (e.g send command to simulate mouse click programatically), you may want to take a look at chrome devtools protocol. It allows external application to control chrome headless browser.
-
Many thanks for your hints; I will try to use chrome as headless browser and look at chrome devtools protocol in order to see if I'm able to understand or not