Recent

Author Topic: HTML files get values  (Read 6229 times)

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #30 on: June 03, 2020, 08:16:29 pm »
 :D :D :D :D thank you so much rvk finally i got the link of the book

 
Code: Pascal  [Select][+][-]
  1. procedure TForm1.ListBox1Click(Sender: TObject);
  2. var
  3.   x1,x2: string;
  4.   x3:Variant;
  5.   page1:AnsiString;
  6.   re2: TRegExpr;
  7.   Browser: olevariant;
  8.   Document: olevariant;
  9.   Body: olevariant;
  10.   begin
  11.   x1:= ListBox1.Items.Strings[ListBox1.ItemIndex];
  12.   x2:= StringReplace(x1,'-e','-d',[rfIgnoreCase,rfReplaceAll]);
  13.   x3:=UTF8Encode('yourwebsite'+x2);
  14.  // url:=UTF8Decode(x3);
  15.     CoInitialize(nil);
  16.     Browser := CreateOleObject('InternetExplorer.Application');
  17.      try
  18.     try
  19.  
  20.       Browser.AddressBar := false;
  21.       Browser.Menubar := false;
  22.       Browser.ToolBar := false;
  23.       Browser.StatusBar := false;
  24.       Browser.Width := 600;
  25.       Browser.Height := 600;
  26.       Browser.Left := Screen.Width div 2 - Browser.Width div 2;
  27.       Browser.Top := Screen.Height div 2 - Browser.Height div 2;
  28.       Browser.Visible := true;
  29.  
  30.       Browser.Navigate(x3);
  31.  
  32.       while (browser.readystate < 4) do
  33.       begin
  34.         Sleep(20000);
  35.         Application.ProcessMessages;
  36.       end;
  37.  
  38.       Document := Browser.Document;
  39.       Body := Document.Body;
  40.       Memo1.Lines.Add(Body.InnerHTML);
  41.  
  42.       Browser.Quit;
  43.  
  44.     except
  45.       on E: Exception do ; // eat exception
  46.     end;
  47.  
  48.   finally
  49.     Browser := Unassigned;
  50.     CoUnInitialize;
  51.   end;                                    

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #31 on: June 03, 2020, 08:21:59 pm »
i applied it on my program i got attached capture
i tried with CoInitialize and without same result
On what OS version are you trying this.
What version of Internet Explorer do you have installed?

For example IE6 doesn't support jquery and you might get that error.

If you run Internet Explorer (so NOT Chrome or other browser) and go to that site... Do you get the same error?
 


Ah. It worked... Good.

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #32 on: June 05, 2020, 07:49:35 am »
hi all ,
i found another type of links like this

/download.pdf?id=158527426&amp;h=2a2e7156d5eb07e0bb5d263b666d9052&amp;u=cache&amp;ext=pdf

i tried to download it with simple get it didnt work

could you please advise

Regards
Alaa

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #33 on: June 05, 2020, 09:38:13 am »
i tried to download it with simple get it didnt work
"it didn't work" isn't really useful information and without the actual url itself we can't know what's going on.

What didn't work?
Does the link work in Internet Explorer?
Do you need to confirm something on the webpage?


alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #34 on: June 05, 2020, 10:00:33 am »
Quote
Does the link work in Internet Explorer?
yes when you open with chrome or firefox it show save/open dialog to download it
Quote
Do you need to confirm something on the webpage?
no when i download it with simpleget method i got only 1 kb of it
i think the previous link which i share before direct to another link

Code: HTML5  [Select][+][-]
  1. <a type="button" class="btn btn-primary btn-user" href="/download.pdf?id=158527426&amp;h=2a2e7156d5eb07e0bb5d263b666d9052&amp;u=cache&amp;ext=pdf" target="_blank" onclick="ga('send', 'event', 'HiddenMenuDownload', 'DownloadORJPDF');" style="border-top-left-radius: 3px;border-bottom-left-radius: 3px;">
  2. <i class="fas fa-cloud-download-alt" aria-hidden="true" style="margin-right: 9px;margin-left: 2px;font-size: 25px;vertical-align: middle;color: #119802;"></i>Download ( PDF )
  3. </a>

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #35 on: June 05, 2020, 10:06:42 am »
no when i download it with simpleget method i got only 1 kb of it
It is very possible that the download.pdf is just a simple page which executes javascript or is even a redirect page. In that case SimpleGet won't work (on it's own).

But it all depends what's in the 1kb download you do get on that simpleget.
Does it have location in the header?
Does it have javascript?

Again, without full source of that file or complete url it's just guessing from our side.

trev

  • Hero Member
  • *****
  • Posts: 718
  • Former Delphi 1-7 and 10.2 User
Re: HTML files get values
« Reply #36 on: June 05, 2020, 10:59:06 am »
It would help if the OP would tell us which website this is - I gave up helping when he didn't.
o Lazarus v2.1.0 r63272, FPC v3.3.1 r45525, macOS 10.14.6 (with sup update), Xcode 11.3.1
o Lazarus v2.1.0 r61574, FPC v3.3.1 r42318, FreeBSD 12.1 amd64 (Parallels VM)
o FPC 3.0.4, FreeBSD 12-STABLE r361007 amd64
o Lazarus v2.1.0 r61574, FPC v3.0.4, Ubuntu 18.04 (Parallels VM)

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #37 on: June 05, 2020, 04:39:45 pm »
https://www.pdfdrive.com/

now please help

Regards
Alaa

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #38 on: June 05, 2020, 05:59:23 pm »
https://www.pdfdrive.com/
now please help
As I expected, the \download.pdf URL you have given, results in a 301 code, which is a redirection.
If you folow that redirection (with Allowredirect := true) it still gives a 400 error code (Bad Request).

There is still something that the server expects and is not given.
Cookies should be saved between the redirection.
I also tried a different User-agent, but that also didn't work.

Will try later on.

Code for now:
Code: Pascal  [Select][+][-]
  1. uses fphttpclient, opensslsockets;
  2.  
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5.   HTTP: TFPHttpClient;
  6.   Stream: TMemoryStream;
  7.   URL: string;
  8. begin
  9.   HTTP := TFPHttpClient.Create(nil);
  10.   Stream := TMemoryStream.Create;
  11.   try
  12.     HTTP.AllowRedirect := true;
  13.     HTTP.AddHeader('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36');
  14.     URL := 'https://www.pdfdrive.com/download.pdf?id=10172273&h=84f0f3490acb0a861ce0cf97be914eed&u=cache&ext=pdf';
  15.     try
  16.       HTTP.HTTPMethod('GET', URL, Stream, [200, 301, 400]); // 400 shouldn't happen !
  17.       Stream.SaveToFile('c:\temp\test.pdf');
  18.  
  19.       // HTTP.Get(URL, 'c:\temp\test.pdf'); // gives a 301 exception
  20.  
  21.       Memo1.Lines.Add(HTTP.ResponseHeaders.Text);
  22.     except
  23.       on E: Exception do
  24.         Memo1.Lines.Add(HTTP.ResponseHeaders.Text);
  25.     end;
  26.   finally
  27.     HTTP.Free;
  28.     Stream.Free;
  29.   end;
  30.  
  31. end;

wget on linux works fine on the link so it should be too difficult.
« Last Edit: June 05, 2020, 06:01:19 pm by rvk »

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #39 on: June 05, 2020, 06:21:28 pm »
Ok, first I thought maybe the cookies where not saved during redirect. But it turns out the cookies don't matter.

BUT... On the first redirect you get an URL with a parameter for the filename. That filename contains spaced. But if you directly use that URL you'll get a 400 error (bad request). It turns out you need to encode the filename (replace the spaces by %20). Wget on Linux, Chrome and IE etc. does this automatically.

So the redirect of fphttpclient is somewhat flawed that there is no callback so we can change the redirect URL. So we need to catch the 301 ourselves and encode the redirect correctly.

Something like this: (worked for me)

Code: Pascal  [Select][+][-]
  1. uses fphttpclient, opensslsockets;
  2.  
  3. procedure TForm1.Button1Click(Sender: TObject);
  4. var
  5.   HTTP: TFPHttpClient;
  6.   Stream: TMemoryStream;
  7.   URL: string;
  8. begin
  9.   URL := 'https://www.pdfdrive.com/download.pdf?id=10172273&h=84f0f3490acb0a861ce0cf97be914eed&u=cache&ext=pdf';
  10.   HTTP := TFPHttpClient.Create(nil);
  11.   Stream := TMemoryStream.Create;
  12.   try
  13.     HTTP.AllowRedirect := false;
  14.     HTTP.AddHeader('User-Agent', 'Wget/1.20.1 (linux-gnu)');
  15.     HTTP.HTTPMethod('GET', URL, Stream, [200, 301]);
  16.     if HTTP.ResponseStatusCode = 301 then
  17.     begin
  18.       URL := HTTP.GetHeader(HTTP.ResponseHeaders,'Location');
  19.       URL := StringReplace(URL, ' ', '%20', [rfReplaceAll]); // IMPORTANT
  20.       HTTP.HTTPMethod('GET', URL, Stream, [200]);
  21.     end;
  22.     Stream.SaveToFile('c:\temp\test.pdf');
  23.   finally
  24.     HTTP.Free;
  25.     Stream.Free;
  26.   end;
  27. end;

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #40 on: June 05, 2020, 06:25:55 pm »
does this work on windows ?

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #41 on: June 05, 2020, 06:32:43 pm »
does this work on windows ?
The procedure i showed works on Windows, yes (i made it on Windows).

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #42 on: June 05, 2020, 07:32:45 pm »
thanks RVK it worked ,
Code: Pascal  [Select][+][-]
  1. baseurl:=('https://www.pdfdrive.com/search?q=' + Edit1.Text +'&pagecount=&pubyear=&searchin=&page=');
  2.    With TFPHttpClient.Create(Nil) do
  3.     try
  4.       page :=Get(baseUrl);// Find all book urls
  5.     finally
  6.      free;
  7.     end;                

i am trying to allow user to type any keyword to search for the books in it
but i see the web site has different methodes like in captures

so how to figure out which link i sould use to extract books names

thanks

rvk

  • Hero Member
  • *****
  • Posts: 4158
Re: HTML files get values
« Reply #43 on: June 05, 2020, 11:06:48 pm »
so how to figure out which link i sould use to extract books names
It seems that for some search-terms, the result is xxxxx-book.html.
But I don't think it matters much.
https://www.pdfdrive.com/search?q=python&pagecount=&pubyear=&searchin=&more=true
gives about the same result as
https://www.pdfdrive.com/python-books.html

And if you do something like this
https://www.pdfdrive.com/somethingelse-books.html
you end up on this url
https://www.pdfdrive.com/search?q=somethingelse&pagecount=&pubyear=&searchin=&more=true

(Just make sure allowredirect := true so a redirect will be followed if it's given.)

alaa123456789

  • Jr. Member
  • **
  • Posts: 68
Re: HTML files get values
« Reply #44 on: June 07, 2020, 07:03:47 pm »
hi
Quote
https://www.pdfdrive.com/search?q=somethingelse&pagecount=&pubyear=&searchin=&more=true
i have done this before but i got error when you are searching for more than word as example " visual basic " or visual basic 6 "  i got error

Code: Pascal  [Select][+][-]
  1.  ListBox1.Items.Clear;
  2.   baseurl:=('https://www.pdfdrive.com/search?q=' + Edit1.Text +'&pagecount=&pubyear=&searchin=&more=true');
  3.   http1:=TFPHttpClient.Create(Nil);
  4.   With http1 do
  5.     try
  6.       http1.AllowRedirect:=true;
  7.       page :=http1.SimpleGet(baseUrl);// Find all book urls
  8.     finally
  9.      free;
  10.     end;                      

 

TinyPortal © 2005-2018