Recent

Author Topic: Scraping data from a web page  (Read 31979 times)

Caravelle

  • Jr. Member
  • **
  • Posts: 52
Scraping data from a web page
« on: September 25, 2013, 12:03:03 am »
Laz 1.0.12; FPC 2.6.2; Vista 32.  New to Lazarus, have used Delphi 7.

I think I am probably trying to do much the same as requested in the topic "how to extract text only from a tiphtmlpanel" below, but despite reading a lot of wiki pages I do not really know where to start. 

I am writing a Windows equivalent of a working program I have written in Basic4Android (B4A) for my Samsung Galaxy tablet. For the tasks below, B4A has libraries and examples which make it easy but I can't work out where to start with Lazarus; 3rd party components seem to be needed, but which ones ?  They have to be simple enough for me to understand with clear guidance and simple practical examples.

task 1: log into a specific website - URL, username, password all hard-coded.  Check for success.

task 2 ; go to the same site with a search parameter tacked on the URL and copy the resulting html (not any images etc) to a local file.

I do not need or want to display any browser window, it should be a silent invisible process.

Then I extract the data I need from the saved file, which I can hopefully manage for myself.

Given that I can do much the same thing, albeit more slowly, by using my browser and copying and pasting, I'm not keen to waste more time developing the routine than I could possibly save using it.  I am the only one who will be using the program.

So, is there a simple practical way to "scrape" a web-page with Lazarus ?

Many thanks

Caravelle

dcminus

  • New Member
  • *
  • Posts: 24
Re: Scraping data from a web page
« Reply #1 on: September 25, 2013, 01:32:11 am »
I would recommended Timewarps example as a start:

http://forum.lazarus.freepascal.org/index.php/topic,19506.0.html

I did a similar program for work.

First you need to know which field (user, pass) you need to fill.

few functions to click a button and fill the form field needed to login

once logged in .navigate to the search url

again fill the forms and save the result should be easy too.

Hope that helps...

I use the following functions:

Code: [Select]
function GetFormByNumber(document: IHTMLDocument2;
  formNumber: integer): IHTMLFormElement;
var
  Forms: IHTMLElementCollection;
begin
  Forms := document.Forms as IHTMLElementCollection;
  if formNumber < Forms.Length then
    Result := Forms.Item(formNumber, '') as IHTMLFormElement
  else
    Result := nil;
end;

function GetFieldValue(fromForm: IHTMLFormElement;
  const fieldName: string): string;
var
  field: IHTMLElement;
  inputField: IHTMLInputElement;
  selectField: IHTMLSelectElement;
  textField: IHTMLTextAreaElement;
begin

  field := fromForm.Item(fieldName, '') as IHTMLElement;
  if not Assigned(field) then
    Result := ''
  else if field.tagName = 'INPUT' then
  begin
    inputField := field as IHTMLInputElement;
    if (inputField.type_ <> 'radio') and (inputField.type_ <> 'checkbox') then
      Result := inputField.value
    else if inputField.checked then
      Result := 'checked'
    else
      Result := 'unchecked';
  end
  else if field.tagName = 'SELECT' then
  begin
    selectField := field as IHTMLSelectElement;
    Result := selectField.value
  end
  else if field.tagName = 'TEXTAREA' then
  begin
    textField := field as IHTMLTextAreaElement;
    Result := textField.value;
  end;
end;


procedure SetFieldValue(theForm: IHTMLFormElement; const fieldName: string;
  const newValue: string);
var
  field: IHTMLElement;
  inputField: IHTMLInputElement;
  selectField: IHTMLSelectElement;
  textField: IHTMLTextAreaElement;
begin
  field := theForm.Item(fieldName, '') as IHTMLElement;
  if Assigned(field) then
  begin
    if field.tagName = 'INPUT' then
    begin
      inputField := field as IHTMLInputElement;
      // Make the change below to catch checks and radios.
      if (inputField.type_ = 'checkbox') or (inputField.type_ = 'radio') then
      begin
        if newValue = 'Y' then
          inputField.Checked := True
        else
          inputField.Checked := False;
      end
      else
        inputField.Value := newValue;
    end
    else if field.tagName = 'SELECT' then
    begin
      selectField := field as IHTMLSelectElement;
      selectField.Value := newValue;
    end
    else if field.tagName = 'TEXTAREA' then
    begin
      textField := field as IHTMLTextAreaElement;
      textField.Value := newValue;
    end;
  end;
end;

procedure ClickElementByID(iDoc1: IHTMLDocument2; ClickThis: string);
var
  WebForm: IHTMLFormElement;
  FormElements: olevariant;
  I: integer;
begin
  if Assigned(iDoc1) then
  begin
    WebForm := iDoc1.Forms.Item(0, '') as IHTMLFormElement;
    FormElements := WebForm.Elements;
    // Search for element with value
    for I := 0 to FormElements.Length - 1 do
    begin
      if FormElements.Item(I).getAttribute('id')  = ClickThis then
        // click on that element
        FormElements.Item(I).Click;
    end;
  end;
end;


procedure ClickElementByValue(iDoc1: IHTMLDocument2; ClickThis: string);
var
  WebForm: IHTMLFormElement;
  FormElements: olevariant;
  I: integer;
begin
  if Assigned(iDoc1) then
  begin
    WebForm := iDoc1.Forms.Item(0, '') as IHTMLFormElement;
    FormElements := WebForm.Elements;
    // Search for element with value
    for I := 0 to FormElements.Length - 1 do
    begin
      if FormElements.Item(I).Value = ClickThis then
        // click on that element
        FormElements.Item(I).Click;
    end;
  end;
end;


at the end of the WB.DocumentComplete event I have the following to do the login:

Code: [Select]
  if Pos('portal.asp?', URL) > 0 then
    with WB.ComServer do
    begin
      SetFieldValue(GetFormByNumber(document as IHTMLDocument2, 0), 'userid', 'me');
      SetFieldValue(GetFormByNumber(document as IHTMLDocument2, 0), 'pwd', 'password');
      ClickElementByValue(document as IHTMLDocument2, 'Login');
    end;
 end;

this is just to start, it all depends on your specific website that you need to use...

Mujie

  • Jr. Member
  • **
  • Posts: 64
Re: Scraping data from a web page
« Reply #2 on: September 25, 2013, 05:02:38 am »
Hi @Caravelle

For login case, I would recomend you to use synapse http://wiki.freepascal.org/Synapse. You should know what field. Example :

Code: [Select]
uses
..., httpsend, synautil, synacode;
....
procedure TfmForm1.Button1Click(Sender: TObject);var
  aHTTP: THTTPSend;
  URL: string;
  Params: string;
  Response: TMemoryStream;
begin
  aHTTP := THTTPSend.Create;
  Response := TMemoryStream.Create;
  try
    //POST something to http://localhost/yourwebpage/?action=myactioncase&mydata1=???&whatever=???
    URL := 'http://localhost/yourwebpage/?action=myactioncase';
    Params := 'mydata1='+ EncodeURLElement(Edit1.text) + '&' +
              'whatever='+ EncodeURLElement(Edit2.text);
    if HttpPostURL(URL, Params, Response) then
    begin
      //Get the result and save it to file
      Response.Position := 0;
      Response.SaveToFile('yourwebresult.txt');
    end
    else
      MessageBox(0,PChar('Can not access'),PChar('Warning'),MB_OK);
  finally
    aHTTP.Free;
    Response.Free;
  end;
end;

And at your PHP codes, write :

Code: [Select]
<?php
  $action 
$_GET['action'];

  switch (
$action)
  {
    case 
"myactioncase":
      echo 
"I try to post some ".$_POST['mydata1']." with another ".$_POST['whatever'];
    break;
  
    default:
      die();
    break;
  }
?>


For check login success or not you can use some return value which you write it at your PHP codes and parse it at you lazarus/pascal code, likes boolean or integer type.

BeniBela

  • Hero Member
  • *****
  • Posts: 959
    • homepage
Re: Scraping data from a web page
« Reply #3 on: September 25, 2013, 12:51:55 pm »
This stuff is the easiest with my Internet Tools. (if no javascript is involved)

With them dcminus's example can be written as:

Code: [Select]
uses simpleinternet;
httpRequest(process('...portal.asp?', 'form((//form)[1], {"userid": "me", "pwd": "password"})'))

And Mujie's example as

Code: [Select]
uses bbutils, simpleinternet, internetaccess;
strSaveToFileUTF8('yourwebresult.txt',
  httpRequest('http://localhost/yourwebpage/?action=myactioncase',
      'mydata1='+ TInternetAccess.urlEncodeData(Edit1.text) + '&' +
      'whatever='+ TInternetAccess.urlEncodeData(Edit2.text)));


Best is, not to save the file, and extract everything directly. Usually only needs  a single line.

Caravelle

  • Jr. Member
  • **
  • Posts: 52
Re: Scraping data from a web page
« Reply #4 on: September 25, 2013, 11:22:27 pm »
Many thanks for all three replies, which I shall print out and read carefully.  Despite working wih computers since Sinclair ZX81 days, I still find printed text on paper easier to understand.  I'll try out the suggestions and let you know how I get on.

Just a quick response to:
Quote
Best is, not to save the file, and extract everything directly. Usually only needs  a single line.

My B4A program does it that way, because that's the way the relevant B4A parsing components work.  I did query it in the B4A Forum at the time.  Actually, to start with at least while developing the parsing routine- it will be useful to be able to examine the downloaded html (which is rather complicated) so a saved file could be quite useful.  The parse routine is actually quite complicated, the bits I want to "read" are quite hard to pull out.

Thanks again to all

Caravelle

Caravelle

  • Jr. Member
  • **
  • Posts: 52
Re: Scraping data from a web page
« Reply #5 on: October 23, 2013, 11:57:32 pm »
Thanks in particular to Benibela, I have managed to use InternetTools to get the desired webpage into a string, like this:
Code: [Select]
P := httpRequest('http://www.planespotters.net/search.php?q=' + Edit1.Text);It is much easier to construct the URL to the search term in code than attempt to fill in input controls which just have the same result.

I then extract the data I want from the page with a series of fairly straightforward GetPart() text extractions.  It all works and surprisingly quickly.

However, I am struggling to log in to the site.  Fortunately I only need to log in if I exceed a certain number of lookups, which is OK for testing.  Benibela provided this code:
Code: [Select]
httpRequest(process('...portal.asp?', 'form((//form)[1], {"userid": "me", "pwd": "password"})'))I copied and pasted this into a new ButtonClick procedure and changed the URL, "userid" to "username", and "pwd" to "password", these being the element names that worked in the Basic4Android version of my program  (this sends javascript to the login page).  And obviously I changed "me" and "password" to my actual username and password, but shown as @ symbols below.  This resulted in:
Code: [Select]
httpRequest(process('http://www.planespotters.net/login.php', 'form((//form)[1], {"username": "@@@@", "password": "@@@@"})'));However on clicking the button I got the following error message (with my username and password changed again):

Quote
Debugger Exception Notification
Project project1 raised exception class 'EXQParsingException' with message:
err:XPST003: Unexpected{,(Enable json extension, to create a json like object)
in form((//form){1],{[<- error occurs before here] "username"; "@@@@", "password": "@@@@"})

In file '.data\xquery_parse.inc' at line 90:
RaiseEXQParsingException.Create(errcode,s+#13#10'in: + 'strslice(@str[1],pos-1)+'[<- error occurs before here]'+strslice(pos,@str[length(str)]));

If anyone knows how to copy the contents of one of these notifications to the clipboard, please let me know, typing the above took forever and may not be perfectly accurate !  Needless to say, I don't have a clue what it is about.  So, is there an error in the quoted code?

Also I do not understand how this code can trigger the login button.  If it helps, the login page is http://www.planespotters.net/login.php and you can of course use your browser to view the code. 

I would be most grateful for a solution.  Many thanks.

Caravelle
« Last Edit: October 24, 2013, 12:00:38 am by Caravelle »

Timewarp

  • Full Member
  • ***
  • Posts: 144
Re: Scraping data from a web page
« Reply #6 on: October 24, 2013, 11:24:23 am »
If Windows only, this should work. Simple and doesn't need extra components, works Delphi too.
I've been using similar for website logins many years.

Code: [Select]
uses variants, comobj;

procedure TForm1.Button1Click(Sender: TObject);
const timeout = 10000;
var httpreq, URL, PostData: OleVariant;
    data: string;
    i: integer;
    startedtime: DWord;
begin
  URL:='http://www.planespotters.net/login.php';
  data:='username=YOURUSERNAME&password=YOURPASSWORD&login=1';

  PostData:=VarArrayCreate([0, Length(data)-1], varByte);
  for i:=1 to Length(data) do
    PostData[i-1]:=Ord(data[i]);
  httpreq:=CreateOleObject('MSXML2.XMLHTTP.6.0');
  startedTime:=GetTickCount;
  httpreq.open('POST', URL, true);
  httpreq.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');
  httpreq.send(PostData);
  while (httpreq.ReadyState<>4) and (GetTickCount<startedtime+timeout) do
  begin
    application.processmessages;
    sleep(10);
  end;
  if (GetTickCount>=startedtime+timeout) then
  begin
    httpreq.Abort;
    showmessage('Timeout');
    exit;
  end;
  showmessage(httpreq.ResponseText);
end; 

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Scraping data from a web page
« Reply #7 on: October 24, 2013, 04:53:36 pm »
My guess the error is a simple typo, marked it in red:

Unexpected{ ...... in form((//form){1]

Simply replace this curly bracket { with the expected square bracket [ in that location.

BeniBela

  • Hero Member
  • *****
  • Posts: 959
    • homepage
Re: Scraping data from a web page
« Reply #8 on: October 24, 2013, 06:45:09 pm »
Quote
Debugger Exception Notification
Project project1 raised exception class 'EXQParsingException' with message:
err:XPST003: Unexpected{,(Enable json extension, to create a json like object)
in form((//form){1],{[<- error occurs before here] "username"; "@@@@", "password": "@@@@"})

In file '.data\xquery_parse.inc' at line 90:
RaiseEXQParsingException.Create(errcode,s+#13#10'in: + 'strslice(@str[1],pos-1)+'[<- error occurs before here]'+strslice(pos,@str[length(str)]));

If anyone knows how to copy the contents of one of these notifications to the clipboard, please let me know, typing the above took forever and may not be perfectly accurate !  Needless to say, I don't have a clue what it is about.  So, is there an error in the quoted code?

I forgot to mention that the json extension needs to be enabled (changed that recently, in older versions it was not necessary) .

Just write:

Code: [Select]
uses xquery_json;

it's sets a global option in its initialization

Also I do not understand how this code can trigger the login button.  If it helps, the login page is http://www.planespotters.net/login.php and you can of course use your browser to view the code. 

(//form)[1] find the first form.

form((//form)[1]) returns an object describing the http request which is sent when the form is submitted

form((//form)[1], {"username": "@@@@", "password": "@@@@"}) returns an object describing the http request which is sent when the form is submitted after the fields with name username and password are set accordingly



Quote
Best is, not to save the file, and extract everything directly. Usually only needs  a single line.

My B4A program does it that way, because that's the way the relevant B4A parsing components work.  I did query it in the B4A Forum at the time.  Actually, to start with at least while developing the parsing routine- it will be useful to be able to examine the downloaded html (which is rather complicated) so a saved file could be quite useful.  The parse routine is actually quite complicated, the bits I want to "read" are quite hard to pull out.


And now I remember how I handle this in my program.

I use a global logger which, when enabled, saves all downloaded html pages, so when something does not work, I can check it there
« Last Edit: October 24, 2013, 06:48:03 pm by BeniBela »

Caravelle

  • Jr. Member
  • **
  • Posts: 52
Re: Scraping data from a web page
« Reply #9 on: October 24, 2013, 11:36:37 pm »
Many thanks once again.

Timewarp
Yes, we are talking Windows for this Lazarus program.  I'll save that code away, thanks.  But as I'm using Internet Tools already to get the web-page content and Benibela has provided the fix which makes the code he gave work, I'll use that.

engkin
Thanks, I did suspect the curly braces, but changing them made no difference, just caused different error messages.  Apparently they are correct.

benibela
Code: [Select]
uses xquery_json; 
Yes, that is indeed the answer.  It just works now. Instantly.  If I log in programmatically then go to the website using my browser I can see that I am flagged as "logged in".  So many thanks for that.

Thanks also for the explanations.  I won't pretend to understand them, my experience is mainly in database work - the technical aspects of web programming are a mystery to me, though I have managed to put up a website, mainly in php.

Next job, testing to see what happens when there is a problem connecting to the site. 

Thanks again

Caravelle

magleft

  • Full Member
  • ***
  • Posts: 125
Re: Scraping data from a web page
« Reply #10 on: December 11, 2013, 03:06:54 pm »
Hi to all.
 I am a novice user. I would like to ask how can I find the names of the fields in order to connect to a website with an application that I created with lazarus?
Thanks
windows 10 64

Leledumbo

  • Hero Member
  • *****
  • Posts: 8836
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: Scraping data from a web page
« Reply #11 on: December 12, 2013, 12:17:59 am »
Quote
how can I find the names of the fields in order to connect to a website with an application that I created with lazarus?
See the website source or its API documentation

 

TinyPortal © 2005-2018