Recent

Author Topic: Scraping Wikipedia pages  (Read 2887 times)

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Scraping Wikipedia pages
« on: September 11, 2021, 01:23:59 pm »
Dear ALL,

Does anyone knows of example FreePascal/Delphi code to scrape a Wikipedia page? I am especially interested in: (1) getting a "snippet" from a given page (that is, the first few lines of the summary, as plain text without any tags); and (2) getting a list of up to five images (if any) available in the gallery section.

For example, given the page: https://en.wikipedia.org/wiki/Vicia_faba

I would like to get the text snippet: "Vicia faba, also known in the culinary sense as the broad bean, fava bean, or faba bean, is a species of flowering plant in the pea and bean family Fabaceae. It is widely cultivated as a crop for human consumption, and also as a cover crop. Varieties with smaller, harder seeds that are fed to horses or other animals are called field bean, tic bean or tick bean. Horse bean, Vicia faba var. equina Pers., is a variety recognized as an accepted name."

I would also like to get a list of the images available in the gallery (up to five):

https://en.wikipedia.org/wiki/File:Vicia_faba.jpg
https://en.wikipedia.org/wiki/File:Crimson_BB1.jpg
https://en.wikipedia.org/wiki/File:Vicia_faba,_broad_bean_seed_showing_outer_seed_coating.jpg
https://en.wikipedia.org/wiki/File:Tuinboon_voor_zaad.jpg
https://en.wikipedia.org/wiki/File:Aphis_fabae,_zwarte_bonenluis.jpg

In Python there is a great library for parsing HTML, called 'BeautifulSoup', which makes such things easy, I wonder if something similar does exist in FPC?

Thanks in advance for any assistance you can provide.

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

MarkMLl

  • Hero Member
  • *****
  • Posts: 6676
Re: Scraping Wikipedia pages
« Reply #1 on: September 11, 2021, 01:45:03 pm »
I'd suggest that you'd be better off going to the underlying data, which I think is already stored as a database which can be downloaded. The summary paragraph is probably marked as such.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Scraping Wikipedia pages
« Reply #2 on: September 11, 2021, 02:39:00 pm »
I would suggest using Internet Tools

Kays

  • Hero Member
  • *****
  • Posts: 569
  • Whasup!?
    • KaiBurghardt.de
Re: Scraping Wikipedia pages
« Reply #3 on: September 11, 2021, 03:32:02 pm »
[…] scrape a Wikipedia page? […]
Don’t do that. If you want to perform heavy-duty analysis on the Wikipedia corpus, please use a database dump https://en.wikipedia.org/wiki/WP:DD

[…] For example, given the page: […] I would like to get the text snippet: […]
For that you could use the REST API, e.g. a request to https://en.wikipedia.org/api/rest_v1/page/summary/Vicia_faba will return a JSON containing extract.
Note, there are other APIs, specifically obtaining any images in a <gallery> will require processing the entire page contents.
Yours Sincerely
Kai Burghardt

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: Scraping Wikipedia pages
« Reply #4 on: September 11, 2021, 05:28:45 pm »
hello,
you can also try to use selenium with my webdriver for lazarus
code to get the summary and the url of images in the web page https://en.wikipedia.org/wiki/Vicia_faba

Code: Pascal  [Select][+][-]
  1.   Try
  2.   {$IFDEF WINDOWS}
  3.   //   Robot := TChromeDriver.Create(nil);
  4.   //   Robot := TFireFoxDriver.Create(nil);
  5.        Robot := TEdgeDriver.Create(nil);
  6.   //   Robot := TIEDriver.Create(Nil);
  7.  
  8.   //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\chromedriver.exe');
  9.   //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\geckodriver.exe');
  10.        Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\msedgedriver.exe');
  11.   //   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '\IEDriverServer.exe');
  12.  
  13.   {$ELSE}
  14.   //Robot := TChromeDriver.Create(nil);
  15.   Robot := TFireFoxDriver.Create(nil);
  16.   // Robot := TFireFoxDriver.Create(nil);
  17.   // Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/chromedriver');
  18.   Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/geckodriver');
  19.   // Robot.StartDriver(ExtractFileDir(Paramstr(0)) + '/msedgedriver');
  20.   {$ENDIF}
  21.   Sleep(2000);
  22.   Robot.NewSession;
  23.   Robot.Implicitly_Wait(2000);
  24.   Robot.Set_Window_Size(640, 640);
  25.   Robot.GetURL('https://en.wikipedia.org/wiki/Vicia_faba');
  26.   summary := Robot.FindElementByXPath('//div[@class="mw-parser-output"]/p[2]');
  27.   Memo1.Append(summary.Text);
  28.   images := Robot.FindElementsByXPath('//a[@class="image"]');
  29.  
  30.   For i:=0 to images.Count - 1 do
  31.   begin
  32.    Memo1.Append(images.Items[i].AttributeValue('href'));
  33.   end;
  34.   Memo1.Append('=================================');
  35.   Sleep(30000);
  36.   Robot.Quit;
  37.   finally
  38.   Robot.Free;
  39.   end;


Result on windows 10 Lazarus 2.0.12 webdriver4L 0.2  using msedge chromium as browser in attachment.

the REST api solution given by Kays is more simple :
to have the summay and the media list of the page in json format :
https://en.wikipedia.org/api/rest_v1/page/summary/Vicia_faba
https://en.wikipedia.org/api/rest_v1/page/media-list/Vicia_faba
the job is to extract the informations from the json responses.

Friendly, J.P
« Last Edit: September 11, 2021, 05:54:45 pm by Jurassic Pork »
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

BobDog

  • Sr. Member
  • ****
  • Posts: 394
Re: Scraping Wikipedia pages
« Reply #5 on: September 11, 2021, 08:36:13 pm »

Windows.
This gives a .txt and .html, take your choice.
I have set the wiki page as default.
Code: Pascal  [Select][+][-]
  1. program WebPageToText;
  2.  function  system(s:pchar):integer ; cdecl external 'msvcrt.dll' name 'system';
  3.  
  4.  var
  5.  g:ansistring;
  6. defaultstring:ansistring = 'https://en.wikipedia.org/wiki/Vicia_faba';     // set as default
  7.  kill:integer=1;
  8.  
  9.  procedure savefile(fname:string ;text:ansistring;killflag:integer=0);
  10.  label
  11.  kill;
  12. Var
  13.  T:TextFile;
  14. begin
  15. if killflag<>0 then goto kill;
  16.    AssignFile(T,fname);
  17.    {$I-}
  18.    try
  19.    Rewrite(T);
  20.    Writeln(T,text);
  21.    finally
  22.    CloseFile(T);
  23.    {$I+}
  24.    end;
  25.    kill:
  26.   if killflag<>0 then erase(T);
  27. end;
  28.  
  29. procedure runscript(filename:ansiString);
  30. begin
  31.   system(pchar('cscript.exe /Nologo '+ filename) );
  32. End;
  33.  
  34. begin
  35. g:=g+ 'Const TriStateTrue = -1 '+chr(10);
  36. g:=g+ 'URL = InputBox("Enter (or paste) the URL to extract the Code "&vbcr&vbcr&_'+chr(10);
  37. g:=g+ '"Exemple ""https://www.freebasic.net""","Extraction of Source text and html  ","'+defaultstring+'")'+chr(10);
  38. g:=g+ 'If URL = "" Then WScript.Quit'+chr(10);
  39. g:=g+ 'Titre = "Extraction du Code Source de " & URL'+chr(10);
  40. g:=g+ 'Set ie = CreateObject("InternetExplorer.Application")'+chr(10);
  41. g:=g+ 'Set objFSO = CreateObject("Scripting.FileSystemObject")'+chr(10);
  42. g:=g+ 'ie.Navigate(URL)'+chr(10);
  43. g:=g+ 'ie.Visible=false'+chr(10);
  44. g:=g+ 'DO WHILE ie.busy'+chr(10);
  45. g:=g+ 'LOOP'+chr(10);
  46. g:=g+ 'DataHTML = ie.document.documentElement.innerHTML'+chr(10);
  47. g:=g+ 'DataTxt = ie.document.documentElement.innerText'+chr(10);
  48. g:=g+ 'strFileHTML = "CodeSourceHTML.txt"'+chr(10);
  49. g:=g+ 'strFileTxt = "CodeSourceTxt.txt"'+chr(10);
  50. g:=g+ 'Set objHTMLFile = objFSO.OpenTextFile(strFileHTML,2,True, TriStateTrue)'+chr(10);
  51. g:=g+ 'objHTMLFile.WriteLine(DataHTML)'+chr(10);
  52. g:=g+ 'objHTMLFile.Close'+chr(10);
  53. g:=g+ 'Set objTxtFile = objFSO.OpenTextFile(strFileTxt,2,True, TriStateTrue)'+chr(10);
  54. g:=g+ 'objTxtFile.WriteLine(DataTxt)'+chr(10);
  55. g:=g+ 'objTxtFile.Close'+chr(10);
  56. g:=g+ 'ie.Quit'+chr(10);
  57. g:=g+ 'Set ie=Nothing'+chr(10);
  58. g:=g+ ' Ouvrir(strFileHTML)'+chr(10);
  59. g:=g+ ' Ouvrir(strFileTxt)'+chr(10);
  60. g:=g+ 'wscript.Quit'+chr(10);
  61. g:=g+ 'Function Ouvrir(File)'+chr(10);
  62. g:=g+ '    Set ws=CreateObject("wscript.shell")'+chr(10);
  63. g:=g+ '    ws.run "Notepad.exe "& File,1,False'+chr(10);
  64. g:=g+ 'end Function'+chr(10);
  65.  
  66.  
  67.  
  68. savefile('script.vbs',g) ;
  69. runscript('script.vbs');
  70.  
  71. writeln('Press enter to end . . .');
  72. readln;
  73. savefile('script.vbs','',kill);
  74. end.
  75.  

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: Scraping Wikipedia pages
« Reply #6 on: September 11, 2021, 11:36:11 pm »
Dear Wizards,

Thank you all very much for your insightful answers and helpful suggestions.

Using a database dump is not an option to me because my application is aimed at providing a specialized search engine providing the user with information on biological species automagically compiled from several sources on the web.

The possibility of using the Wikipedia API as suggested by @Kays (and @Jurassic Pork), which offers the summary extract as required, looks great; for getting the image links, I could use the internettools library as suggested by @engkin. As @JP's pointed out, it is also possible to get the image links using the API.

On the other side, I did not know the "webdriver for lazarus" library used by @Jurassic Pork; I do agree with him that using the REST API is much simpler, but I would like to know more about webdriver not only out of curiosity, but because it looks great and should be useful for other applications! (I could not find it in the link provided in JP's answer, that redirects ti the main page of the forum).

Unfortunately, the suggestion by @BobDog is for Windoze only and it does not suit me because my application should be a CGI application running on a LAMP server (and also because Windoze sucks).

Again, thank you very much!

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: Scraping Wikipedia pages
« Reply #7 on: September 12, 2021, 12:17:45 am »
(I could not find it in the link provided in JP's answer, that redirects ti the main page of the forum).

Are you logged in when your click on the link ? you can also search in the forum for the last 99 days  browser automation and go to the last post
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: Scraping Wikipedia pages
« Reply #8 on: September 12, 2021, 12:34:35 am »
@JP.

Sure, I was not logged when I first tried to access the link. I got it now.

Thank you!

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: Scraping Wikipedia pages
« Reply #9 on: September 12, 2021, 04:10:29 pm »
Dear ALL,

Here is my sample code for getting the text summary and a list of (up to five) image files for a Wikipedia page. Notice that the sample code uses the really great JsonTools library available from here: https://www.getlazarus.org/json/, which makes working with JSON data much easier than suing the buiil-in FPC libraries.

Code: Pascal  [Select][+][-]
  1. program wiki_example;
  2.  
  3. {$APPTYPE CONSOLE}
  4. {$mode objfpc}{$H+}
  5.  
  6. uses SysUtils, fphttpclient, JsonTools;
  7.  
  8. const
  9.         url1 = 'https://en.wikipedia.org/api/rest_v1/page/summary/';
  10.        url2 = 'https://en.wikipedia.org/api/rest_v1/page/media-list/';
  11.         searchName = 'Vicia faba';
  12.  
  13. var
  14.   rawJson: AnsiString;
  15.   JsonData: TJsonNode;
  16.   N: integer;
  17.  
  18. begin
  19.   WriteLn('Searching Wikipedia for: ' + searchName);
  20.   // Get the page summary
  21.   JsonData := TJsonNode.Create;
  22.   rawJson := TFPHTTPClient.SimpleGet(url1 + StringReplace(searchName, ' ', '_', [rfReplaceAll]));
  23.   JsonData.Parse(rawJson);
  24.   WriteLn(JsonData.Find('extract').AsString);
  25.   JsonData.Free;
  26.   WriteLn;
  27.  
  28.   // Get the images in the gallery
  29.   JsonData := TJsonNode.Create;
  30.   rawJson := TFPHTTPClient.SimpleGet(url2 + StringReplace(searchName, ' ', '_', [rfReplaceAll]));
  31.   JsonData.Parse(rawJson);
  32.   for N := 1 to 5 do
  33.         WriteLn(JsonData.Find('items/' + IntToStr(N - 1) + '/title').AsString);
  34.   JsonData.Free;
  35. end.
  36.  

This code works, but I am not entirely happy with it because I could not devise a way of getting all the image file names, instead of just five, because I could not see how to get the total number of items in the JSON data (in this case, it should be 15). Also, if there are less then five images in a searched page, the above code will fail.

Could anyone give me a hint?

Thanks in advance!

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

BobDog

  • Sr. Member
  • ****
  • Posts: 394
Re: Scraping Wikipedia pages
« Reply #10 on: September 12, 2021, 05:18:11 pm »

Sorry, got all the stuff, compiled the json unit, copied a myriad of units, then I get:
Code: Pascal  [Select][+][-]
  1. Microsoft Windows [Version 10.0.19042.1165]
  2. (c) Microsoft Corporation. All rights reserved.
  3.  
  4. C:\Users\Computer\Desktop\fb\pascal\mystuff\jsoninternet>internet
  5. Searching Wikipedia for: Vicia faba
  6. An unhandled exception occurred at $00431920:
  7. ESSLSocketError: No SSL Socket support compiled in.
  8. Please include opensslsockets unit in program and recompile it.
  9.   $00431920
  10.   $004190A8
  11.   $00419151
  12.   $0041B0F0
  13.   $0041B449
  14.   $0041BA00
  15.   $0041BC67
  16.   $0041BCC3
  17.   $0041BD88
  18.   $0040194C
  19.  
  20.  
  21. C:\Users\Computer\Desktop\fb\pascal\mystuff\jsoninternet>
And I don't have the opensslsockets unit.
Maybe a Linux thing?



maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: Scraping Wikipedia pages
« Reply #11 on: September 12, 2021, 05:45:37 pm »
Dear @BobDog,

Thanks for your message. As a matter of fact, I have compiled and run successfully this code in Windows 10 without any trouble. You probably have to install OpenSSL in your system.

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: Scraping Wikipedia pages
« Reply #12 on: September 12, 2021, 06:04:43 pm »
hello,
maurobio, fpson + jsonparser from  free pascal  fcl-json is also a great tool  :P

try this code :
Code: Pascal  [Select][+][-]
  1. program wiki_example;
  2.  
  3. {$APPTYPE CONSOLE}
  4. {$mode objfpc}{$H+}
  5.  
  6. uses  Classes, TypInfo,SysUtils, fphttpclient, fpjson, jsonparser, opensslsockets;
  7.  
  8. const
  9.         url1 = 'https://en.wikipedia.org/api/rest_v1/page/summary/';
  10.        url2 = 'https://en.wikipedia.org/api/rest_v1/page/media-list/';
  11.         searchName = 'Vicia faba';
  12.  
  13. var
  14.   jData: TJSONData;
  15.   jItem,jItems: TJSONData;
  16.   i: integer;
  17.  
  18. begin
  19.   WriteLn('Searching Wikipedia for: ' + searchName);
  20.   // Get the page summary
  21.   jData := GetJSON(TFPHTTPClient.SimpleGet(url1 + StringReplace(searchName, ' ', '_', [rfReplaceAll])));
  22.   WriteLn(jData.FindPath('extract').AsString);
  23.   WriteLn;
  24.   // Get the images in the gallery
  25.   jData := GetJSON(TFPHTTPClient.SimpleGet(url2 + StringReplace(searchName, ' ', '_', [rfReplaceAll])));
  26.   jItems := jData.FindPath('items');
  27.   for i := 0 to jItems.Count - 1 do
  28.     begin
  29.        jItem := jItems.Items[i];
  30.       writeln(jItem.FindPath('title').AsString);
  31.     end;
  32.   jData.Free;
  33.   readln;
  34.  
  35. end.  

Ok on Windows 10 Lazarus 2.3.0 fpc 3.3.1

Friendly, J.P
 
« Last Edit: September 12, 2021, 06:21:40 pm by Jurassic Pork »
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

maurobio

  • Hero Member
  • *****
  • Posts: 623
  • Ecology is everything.
    • GitHub
Re: Scraping Wikipedia pages
« Reply #13 on: September 12, 2021, 08:40:22 pm »
Dear @Jurassic Pork,

Thank you, as always, for your invaluable help and useful insights.

I do agree that the FPC/Lazarus built-in JSON libraries are also great, but I found them more difficult to use than JsonTools, and documentation is not as extensive as I would like. On the other side, the author of JsonTools presents some convincing comparisons between his library and those of FPC/Lazarus, showing that his are usually faster than other Pascal parsers (https://www.getlazarus.org/json/tests/).

Again, thanks a lot!

With best wishes,
UCSD Pascal / Burroughs 6700 / Master Control Program
Delphi 7.0 Personal Edition
Lazarus 2.0.12 - FPC 3.2.0 on GNU/Linux Mint 19.1, Lubuntu 18.04, Windows XP SP3, Windows 7 Professional, Windows 10 Home

 

TinyPortal © 2005-2018