Forum > Networking and Web Programming

Scraping Wikipedia pages

(1/3) > >>

maurobio:
Dear ALL,

Does anyone knows of example FreePascal/Delphi code to scrape a Wikipedia page? I am especially interested in: (1) getting a "snippet" from a given page (that is, the first few lines of the summary, as plain text without any tags); and (2) getting a list of up to five images (if any) available in the gallery section.

For example, given the page: https://en.wikipedia.org/wiki/Vicia_faba

I would like to get the text snippet: "Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean, is a species of flowering plant in the pea and bean family Fabaceae. It is widely cultivated as a crop for human consumption, and also as a cover crop. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name."

I would also like to get a list of the images available in the gallery (up to five):

https://en.wikipedia.org/wiki/File:Vicia_faba.jpg
https://en.wikipedia.org/wiki/File:Crimson_BB1.jpg
https://en.wikipedia.org/wiki/File:Vicia_faba,_broad_bean_seed_showing_outer_seed_coating.jpg
https://en.wikipedia.org/wiki/File:Tuinboon_voor_zaad.jpg
https://en.wikipedia.org/wiki/File:Aphis_fabae,_zwarte_bonenluis.jpg

In Python there is a great library for parsing HTML, called 'BeautifulSoup', which makes such things easy, I wonder if something similar does exist in FPC?

Thanks in advance for any assistance you can provide.

With best wishes,

MarkMLl:
I'd suggest that you'd be better off going to the underlying data, which I think is already stored as a database which can be downloaded. The summary paragraph is probably marked as such.

MarkMLl

engkin:
I would suggest using Internet Tools

Kays:

--- Quote from: maurobio on September 11, 2021, 01:23:59 pm ---[…] scrape a Wikipedia page? […]
--- End quote ---
Don’t do that. If you want to perform heavy-duty analysis on the Wikipedia corpus, please use a database dump https://en.wikipedia.org/wiki/WP:DD


--- Quote from: maurobio on September 11, 2021, 01:23:59 pm ---[…] For example, given the page: […] I would like to get the text snippet: […]
--- End quote ---
For that you could use the REST API, e.g. a request to https://en.wikipedia.org/api/rest_v1/page/summary/Vicia_faba will return a JSON containing extract.
Note, there are other APIs, specifically obtaining any images in a <gallery> will require processing the entire page contents.

Jurassic Pork:
hello,
you can also try to use selenium with my webdriver for lazarus
code to get the summary and the url of images in the web page https://en.wikipedia.org/wiki/Vicia_faba


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---  Try  {$IFDEF WINDOWS}  //   Robot := TChromeDriver.Create(nil);  //   Robot := TFireFoxDriver.Create(nil);       Robot := TEdgeDriver.Create(nil);  //   Robot := TIEDriver.Create(Nil);   //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\chromedriver.exe');  //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\geckodriver.exe');       Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\msedgedriver.exe');  //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\IEDriverServer.exe');   {$ELSE}  //Robot := TChromeDriver.Create(nil);  Robot := TFireFoxDriver.Create(nil);  // Robot := TFireFoxDriver.Create(nil);  // Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/chromedriver');  Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/geckodriver');  // Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/msedgedriver');  {$ENDIF}  Sleep(2000);  Robot.NewSession;  Robot.Implicitly_Wait(2000);  Robot.Set_Window_Size(640, 640);  Robot.GetURL('https://en.wikipedia.org/wiki/Vicia_faba');  summary := Robot.FindElementByXPath('//div[@class="mw-parser-output"]/p[2]');  Memo1.Append(summary.Text);  images := Robot.FindElementsByXPath('//a[@class="image"]');   For i:=0 to images.Count - 1 do  begin   Memo1.Append(images.Items[i].AttributeValue('href'));  end;  Memo1.Append('=================================');  Sleep(30000);  Robot.Quit;  finally  Robot.Free;  end;

Result on windows 10 Lazarus 2.0.12 webdriver4L 0.2  using msedge chromium as browser in attachment.

the REST api solution given by Kays is more simple :
to have the summay and the media list of the page in json format :
https://en.wikipedia.org/api/rest_v1/page/summary/Vicia_faba
https://en.wikipedia.org/api/rest_v1/page/media-list/Vicia_faba
the job is to extract the informations from the json responses.

Friendly, J.P

Navigation

[0] Message Index

[#] Next page

Go to full version