Recent

Author Topic: How to automate copying the source code of several consecutive pages on the net  (Read 813 times)

franzala

  • New member
  • *
  • Posts: 6
I frequently look at a classified ads site which shows several pages to my requests.

I have written an little application to extract the datas of the ads from the source code of one page but I have to  copy each page to obtain the whole information  which is boring an time consuming.

I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid  such a burden.

Could somebody give me a hint to do that. 

Thanks in advance

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 651
    • Lebeau Software
That is a pretty broad topic, given that you didn't provide any details about how you access the site, or what tasks you perform manually that you want to automate.
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

trev

  • Full Member
  • ***
  • Posts: 185
Quote
I have written an little application to extract the datas of the ads from the source code of one page but I have to  copy each page to obtain the whole information  which is boring an time consuming.

If you've already written the application, I don't quite see how it can be "boring and time consuming" for the application to extract the data from more than one page.

I speculate, as you do not provide the information, that all you have to have your application do extra is to follow the link of one page of results to the next page which sounds pretty trivial. Perhaps you need to elaborate what the issue is?
o Lazarus v2.1.0 r61775, FPC v3.3.1 r42640, macOS 10.14.6 (with sup update), Xcode 10.3
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.0 (Parallels VM)
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

franzala

  • New member
  • *
  • Posts: 6
Hi trev,

thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.

Sorry to ask for a trivial information but I'm not familiar with that stuff.

Remy Lebeau

  • Hero Member
  • *****
  • Posts: 651
    • Lebeau Software
where can I find the link from one page to the next in the source code (or which word or expression will I have to search);

We can't answer that without seeing the actual page.  Do you see such a link when viewing the page in a web browser?  Are the pages numbered in their URLs?
Remy Lebeau
Lebeau Software - Owner, Developer
Internet Direct (Indy) - Admin, Developer (Support forum)

franzala

  • New member
  • *
  • Posts: 6
At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:

https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3

It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this  pages (1 to 8 in this case).

valdir.marcos

  • Hero Member
  • *****
  • Posts: 792
I frequently look at a classified ads site which shows several pages to my requests.
I have written an little application to extract the datas of the ads from the source code of one page but I have to  copy each page to obtain the whole information  which is boring an time consuming.
I want now to automate the copy of the whole pages answering my requests in one shot in order to avoid  such a burden.
Could somebody give me a hint to do that. 
Thanks in advance

Hi trev,
thanks for your answer; where can I find the link from one page to the next in the source code (or which word or expression will I have to search); the code of one page of the site amounts usually about 150 pages of text.
Sorry to ask for a trivial information but I'm not familiar with that stuff.

At the bottom of the page there is a list of the pages and I have copied the link to page 3 to give an example:
https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max&page=3
It is easy to build the other links from this one by changing only the rank of the page; my question is now: how to obtain one or several files with the source code of this  pages (1 to 8 in this case).
Can you attach a small sample project of what you've already done so we can better help you?

madref

  • Hero Member
  • *****
  • Posts: 679
  • ..... A day not Laughed is a day not Lived !!
    • Nursing With Humour
You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 2.0.2 / FPC 3.0.4
Lazarus Trunc / FPC 3.0.4
Mac OS X Mojave

franzala

  • New member
  • *
  • Posts: 6
Short description of what I'm doing up to now

1. I make a search on the ads site which give a answer on for instance 8 pages.

2. I copy manually the source code of each of the 8 pages in 8 files; each source code contains several parts and one of them describes each ad included in the page with its caracteristics.

3. The application I have written extracts from these 8 files the datas which interest me i.e; some of the caracteristics such as the price, the location of the seller..., and save it in a file which I use to compare the several offers (from the same search and also from previous searches)

What I find boring and time consuming, is the manual copy of each page and I'm looking for a way to copy it automatically from the first page which contains the links to the 7 others. Her is a link to such a page:

https://www.leboncoin.fr/recherche/?category=9&locations=Saint-Martin-en-Haut_69850,Saint-Romain-en-Jarez_42800,69210,69280,Chessy_69380,69290,69670,69440,69510&zlat=45.66082&zlng=4.5597&zdefradius=5432&sort=price&order=asc&immo_sell_type=old&real_estate_type=1&price=150000-600000&rooms=3-max&square=110-max


I hope that these short explanation will help to clarify my question.

 
 

zamronypj

  • New Member
  • *
  • Posts: 28
    • Fano Framework, Free Pascal web application framework
I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.

If web page content is static and not using JavaScript, you can use TFPHttpClient.
Fano Framework, Free Pascal web application framework https://fanoframework.github.io
Personal Projects https://v3.juhara.com
Github https://github.com/zamronypj

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5567
    • wiki
Question: Do you have code that automatically downloads a page ? Like the firs page, you tell your app is at http://...... and YOUR APP is downloading it?

Because if I read this correctly, everyone here assumes this.
But you said, you copy the page into a file (as in you do "save as" in your browser), and then your app only does open the file on your harddisk?

In the latter case, you may simply need something like lnet.

franzala

  • New member
  • *
  • Posts: 6
I do not quite understand what you are trying to accomplish. But if you are doing web scraping on a web page that is rendered dynamically with JavaScript, you may want to use headless browser (e.g chrome, phantomJS) and automate the task with a script.

If web page content is static and not using JavaScript, you can use TFPHttpClient.

Yes, what I want to accomplish is web scraping (I did'nt know the name, my english and my knowledge of web are both poor).

The pages are probably dynamic because there is client-side scripting to define the search (and probably also server-side).

If I understand properly your post, I will have to write a JavaScript which means previously learn to do that.

I'd preferred to do it under Pascal with Lazarus!

Thaddy

  • Hero Member
  • *****
  • Posts: 8673
[I'd preferred to do it under Pascal with Lazarus!
It is not going to work in any language, except when you are a really good programmer.
Most people that want to use threading should learn to patch their jeans first: use a needle.

Martin_fr

  • Administrator
  • Hero Member
  • *
  • Posts: 5567
    • wiki
Quick look at the page.

It seems the server may actively try and block downloads from outside a browser. At least "wget" failed for me. (But it could have been me, not escaping the url correctly...)

You also need to check, if you are legally allowed to automatically process their data.
- For none personal use (if you re-publish the results), this is probably illegal => since you would need a license to the data.
- For personal use (only you look at the results) this may depend on the country you live in...

Anyway you will need a very good understanding of web technologies.
The data appears to be in the page (if you can get the page). But you need to parse the format (and the company may change it, then you start over). But that can be done in pascal.
In any case (and completely independent of what language you use) => it is NOT trivial.


Handoko

  • Hero Member
  • *****
  • Posts: 3124
  • My goal: build my own game engine using Lazarus
@franzala

As you have been told, what you want to achieve isn't simple. But if you're already good in string manipulation and data storing in Pascal, you may be interested to try a very simple demo posted by @wp:

https://forum.lazarus.freepascal.org/index.php/topic,35980.msg239199.html#msg239199

I followed the demo, and later created my web forum spam detection program. But because of the limitation of my knowledge, it is half-finished only.

First, using the demo I wrote a very simple web browser. It run extremely slow, so I added my own image caching routine. Much better, unfortunately I can't render the CSS things. By analyzing the website I want to 'scrap', I was able to make the program to follow the links of the pages automatically. But because the website has enormous pages, I had to define the range for scrapping on each run. And I had to slow it down by putting some delay for each pages because some websites or its webhosting's firewall will block incoming users (bots) that tried to request too many pages in a certain of time.

The fetched data is stored in a dbf file, I chose dbf because it needs zero installation and it is capable for the thing I want to do. On each run, the program will download only a certain range of pages and it will update if the pages were already stored. And then I wrote some code to find all the outbound links in the stored data and compare them with my defined blacklist and whitelist keywords. And show a report of pages that contain suspicious spam links.

It sounds simple, just 2 paragraphs to explain it. But actually it took me months to write it. And it became too hard for me so I take a break.

 :-X I've been writing a program for spam detection on this forum.
I do not have the license of the data. But it's just for personal use, so I think it is okay.
« Last Edit: September 12, 2019, 05:52:39 pm by Handoko »