Recent

Author Topic: HTML files get values  (Read 6049 times)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #15 on: May 29, 2020, 05:32:36 pm »
could you please put an example

regards
Alaa

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 796
Re: HTML files get values
« Reply #16 on: May 30, 2020, 06:41:01 am »
hello,
with the internettools package (available in online packagemanager) you can use xpath to find elements in html .
here is a simple project to extract the last page from your html :
Code: Pascal  [Select][+][-]
  1. program xPathTest;
  2. uses Classes, SysUtils, xquery , simpleinternet;
  3. var
  4.   ListValue : IXQValue;
  5.   PageXPath : String;
  6.   extFile : TStringList;
  7.   htmlContent : String;
  8.   linkList: TStringList;
  9. begin
  10.   extFile := TStringList.Create();
  11.   extFile.LoadFromFile('M:\test\Zebra_Page.html');
  12.   linkList := TStringList.Create();
  13.   PageXPath := '//div[@class="Zebra_Pagination"]//li[a]';
  14.   for ListValue in process(extFile.Text,PageXPath) do
  15.      begin
  16.        linkList.Add(ListValue.toString);
  17.      end;
  18.   Writeln( 'Last Page : ' + linkList[linkList.IndexOf('Next')- 1]);
  19.   linkList.Free;
  20.   extFile.Free;
  21. end.



1 - Read the Html File
2 - The Xpath retrieves all the elements  with the PageXPath  description
3 - Put all the values of the elements in a stringlist
4 - The last page value is the value before the "Next" value in the list.


Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #17 on: May 31, 2020, 06:58:05 pm »
thanks JP , i tried to install internet tools showed me this error
"one or more package is required " and didn't know which package is required

regards

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #18 on: May 31, 2020, 08:52:46 pm »
Code: HTML5  [Select][+][-]
  1. <div id="alternatives" class="mt-2" style="text-align: left;">
  2. <div class="text-center">
  3. <a class="btn btn-success btn-responsive" href="http://www.davekuhlman.org/python_book_01.pdf" target="_blank" rel="nofollow" onclick="c();">Go to PDF</a>
  4. </div>
how to get the link of pdf file

and what is the best way to understand and learn tregx
please help
thanks
ALaa

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #19 on: May 31, 2020, 10:04:29 pm »
here is the difficult task in my program i overcome all the previous challenges with your supports (thanks to all of who supported me) but i reach to closed door , the page i want to scrape is showing loader bar for some time then it loads the HTML which has the link of the book i want to get it

how i can overcome this

thanks
Alaa

rvk

  • Hero Member
  • *****
  • Posts: 4143
Re: HTML files get values
« Reply #20 on: May 31, 2020, 10:20:51 pm »
here is the difficult task in my program i overcome all the previous challenges with your supports (thanks to all of who supported me) but i reach to closed door , the page i want to scrape is showing loader bar for some time then it loads the HTML which has the link of the book i want to get it

how i can overcome this
That depends... If the loader bar is javascript (or similar) and the html doesn't get loaded if the javascript isn't executed, you can't just do a gethtml() to get the page.

So first check what your gethtml() function gets as text.

If in that text there is nothing you want to scrape... you'll need to execute the javascript. Maybe you can use the IE engine to get the file (which does execute the javascript).

For example you could use Browser := CreateOleObject('InternetExplorer.Application'); to initiate an internet explorer browser and with navigate, navigate to the correct page and read out the html.
(but that's only under Windows)
If there are any security measures (like Captchas), you'll still need to do some things manually.

trev

  • Hero Member
  • *****
  • Posts: 714
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #21 on: June 01, 2020, 03:34:17 am »
What web site are you trying to scrape?
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 796
Re: HTML files get values
« Reply #22 on: June 01, 2020, 07:10:34 am »
hello,
thanks JP , i tried to install internet tools showed me this error
"one or more package is required " and didn't know which package is required
regards
What is your O.S ? Lazarus version ?
I have tried internettools on WIndows 10 and Centos 8   with Lazarus 2.0.8 : it 's OK if :
In your project :
 1 - menu  Project/Project Inspector    -> in the window  add internettools & internettools_utf8 to  required Packages
 2 - menu   Project/project Options -> in Compiler Options/Paths   for Other Unit files (-Fu) add the paths of some folders of the internettools package  :
path_to_onlinepackagemanager/packages/internettools-master/data
path_to_onlinepackagemanager/packages/internettools-master/system
path_to_onlinepackagemanager/packages/internettools-master/flre-master/src

with internettools you can also download files. Example  :
Code: Pascal  [Select][+][-]
  1. implementation
  2. uses xquery , simpleinternet, bbutils, internetaccess;
  3. {$R *.lfm}
  4. { TForm1 }                                            
  5. procedure TForm1.Bt_DownloadClick(Sender: TObject);
  6. var TargetFile, SiteURL, Download: string;
  7. begin
  8.   SiteURL := 'https://www.coursef.com/more-reviews/www.davekuhlman.org?q=python%203%20tutorial%20pdf%20download';
  9.   TargetFile:='/home/jurassic/mybook.pdf';
  10.   //set user agent (fails without it)
  11.   defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';
  12.   Download := retrieve(SiteURL);
  13.   if strBeginsWith(LowerCase(Download), '<!doctype html>') then begin
  14.     //Download page
  15.     SiteURL:=process(Download, '//@data-full-link').toString;
  16.     Download := retrieve(SiteURL);
  17.     if strBeginsWith(LowerCase(Download), '<!doctype html>') then raise Exception.create('Multiple redirections');
  18.   end;
  19.   strSaveToFileUTF8(TargetFile, Download);
  20. end;

Friendly, J.P

« Last Edit: June 01, 2020, 10:09:39 am by Jurassic Pork »
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #23 on: June 01, 2020, 06:33:26 pm »
here is the difficult task in my program i overcome all the previous challenges with your supports (thanks to all of who supported me) but i reach to closed door , the page i want to scrape is showing loader bar for some time then it loads the HTML which has the link of the book i want to get it

how i can overcome this
That depends... If the loader bar is javascript (or similar) and the html doesn't get loaded if the javascript isn't executed, you can't just do a gethtml() to get the page.

So first check what your gethtml() function gets as text.

If in that text there is nothing you want to scrape... you'll need to execute the javascript. Maybe you can use the IE engine to get the file (which does execute the javascript).

For example you could use Browser := CreateOleObject('InternetExplorer.Application'); to initiate an internet explorer browser and with navigate, navigate to the correct page and read out the html.
(but that's only under Windows)
If there are any security measures (like Captchas), you'll still need to do some things manually.

i am using TFPHTTPClient.simpleget(url) there is no gethtml method in it ,i put the result in memo1 but the text didn't have what i wanted
i tried to use htmlviwer tool to load  url and tried to get the text from it i didn't know how
 if u could help i will be thankful to you


Regards
Alaa

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #24 on: June 01, 2020, 06:51:30 pm »
hello,
thanks JP , i tried to install internet tools showed me this error
"one or more package is required " and didn't know which package is required
regards
What is your O.S ? Lazarus version ?
I have tried internettools on WIndows 10 and Centos 8   with Lazarus 2.0.8 : it 's OK if :
In your project :
 1 - menu  Project/Project Inspector    -> in the window  add internettools & internettools_utf8 to  required Packages
 2 - menu   Project/project Options -> in Compiler Options/Paths   for Other Unit files (-Fu) add the paths of some folders of the internettools package  :
path_to_onlinepackagemanager/packages/internettools-master/data
path_to_onlinepackagemanager/packages/internettools-master/system
path_to_onlinepackagemanager/packages/internettools-master/flre-master/src

with internettools you can also download files. Example  :
Code: Pascal  [Select][+][-]
  1. implementation
  2. uses xquery , simpleinternet, bbutils, internetaccess;
  3. {$R *.lfm}
  4. { TForm1 }                                            
  5. procedure TForm1.Bt_DownloadClick(Sender: TObject);
  6. var TargetFile, SiteURL, Download: string;
  7. begin
  8.   SiteURL := 'https://www.coursef.com/more-reviews/www.davekuhlman.org?q=python%203%20tutorial%20pdf%20download';
  9.   TargetFile:='/home/jurassic/mybook.pdf';
  10.   //set user agent (fails without it)
  11.   defaultInternetConfiguration.userAgent:='curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18';
  12.   Download := retrieve(SiteURL);
  13.   if strBeginsWith(LowerCase(Download), '<!doctype html>') then begin
  14.     //Download page
  15.     SiteURL:=process(Download, '//@data-full-link').toString;
  16.     Download := retrieve(SiteURL);
  17.     if strBeginsWith(LowerCase(Download), '<!doctype html>') then raise Exception.create('Multiple redirections');
  18.   end;
  19.   strSaveToFileUTF8(TargetFile, Download);
  20. end;

Friendly, J.P

Hi jp
i am using lazarud 2.08 64bit on windows 10
i tried your instructions please check attachment

thanks

rvk

  • Hero Member
  • *****
  • Posts: 4143
Re: HTML files get values
« Reply #25 on: June 01, 2020, 11:04:35 pm »
i am using TFPHTTPClient.simpleget(url) there is no gethtml method in it ,i put the result in memo1 but the text didn't have what i wanted
i tried to use htmlviwer tool to load  url and tried to get the text from it i didn't know how
 if u could help i will be thankful to you
I did mean simpleget instead of gethtml. I meant a function to get the html (so get or simpleget).

If you don't get what you want, probably the javascript needs to be executed. I don't think htmlviewer does that. An instance of internet explorer does.

Are you on Windows and if so, what site are you trying because we can help you better if we try some code ourselves.


Jurassic Pork

  • Hero Member
  • *****
  • Posts: 796
Re: HTML files get values
« Reply #26 on: June 02, 2020, 03:14:22 am »
hello
i am using lazarud 2.08 64bit on windows 10
i tried your instructions please check attachment
thanks

my lazarus version is the 32 bits version on windows.
To see what it the broken dependency open the  package graph (menu package/package graph). May be laz_synapse :

Friendly J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #27 on: June 02, 2020, 08:05:58 pm »
Quote
I did mean simpleget instead of gethtml. I meant a function to get the html (so get or simpleget).

If you don't get what you want, probably the javascript needs to be executed. I don't think htmlviewer does that. An instance of internet explorer does.

Are you on Windows and if so, what site are you trying because we can help you better if we try some code ourselves.
i added instance of internet explorer and loaded the page but it seems javascript not allowed to run ,how to run it
Code: Pascal  [Select][+][-]
  1. url:=UTF8Decode(x3);
  2.   onull:=NULL;
  3.  
  4.   Browser.ComServer.Navigate2(url,onull,onull,onull,onull);
then how to get the code in memo if this work
thanks
Alaa

rvk

  • Hero Member
  • *****
  • Posts: 4143
Re: HTML files get values
« Reply #28 on: June 02, 2020, 09:29:01 pm »
i added instance of internet explorer and loaded the page but it seems javascript not allowed to run ,how to run it
It does allow javascript but the embedded version isn't capable of jquery (because it emulates IE6). So you probably shouldn't use that embedded version of Internet Explorer because IE6 is very old. You can increase the emulated version in the registry but that might be a hassle.

It is better to just use a new oleobject (with the full latest IE version).

Something like this:

Code: Pascal  [Select][+][-]
  1. uses
  2.   comobj, // for ceating Browser-object
  3.   ActiveX; // CoInitialize
  4.  
  5. procedure TForm1.Button1Click(Sender: TObject);
  6. const
  7.   GoUrl = 'http://www.yourwebsite.com';
  8. var
  9.   Browser: olevariant;
  10.   Document: olevariant;
  11.   Body: olevariant;
  12. begin
  13.  
  14.   CoInitialize(nil);
  15.   Browser := CreateOleObject('InternetExplorer.Application');
  16.   try
  17.     try
  18.  
  19.       Browser.AddressBar := false;
  20.       Browser.Menubar := false;
  21.       Browser.ToolBar := false;
  22.       Browser.StatusBar := false;
  23.       Browser.Width := 600;
  24.       Browser.Height := 600;
  25.       Browser.Left := Screen.Width div 2 - Browser.Width div 2;
  26.       Browser.Top := Screen.Height div 2 - Browser.Height div 2;
  27.       Browser.Visible := True;
  28.  
  29.       Browser.Navigate(GoUrl);
  30.  
  31.       while (browser.readystate < 4) do
  32.       begin
  33.         Sleep(500);
  34.         Application.ProcessMessages;
  35.       end;
  36.  
  37.       Document := Browser.Document;
  38.       Body := Document.Body;
  39.       Memo1.Lines.Add(Body.InnerHTML);
  40.  
  41.       Browser.Quit;
  42.  
  43.     except
  44.       on E: Exception do ; // eat exception
  45.     end;
  46.  
  47.   finally
  48.     Browser := Unassigned;
  49.     CoUnInitialize;
  50.   end;
  51. end;

I'm not sure if the while loop is enough to make it run the javascript entirely. Otherwise you'll need to build in a delay.

You might even keep the window hidden (comment out Visible := true;)

alaa123456789

  • Jr. Member
  • **
  • Posts: 67
Re: HTML files get values
« Reply #29 on: June 03, 2020, 07:57:36 pm »
Quote
It is better to just use a new oleobject (with the full latest IE version).

hi RvK thanks for your advise i applied it on my program i got attached capture

i tried with CoInitialize and without same result

please advise

regards
Alaa

 

TinyPortal © 2005-2018