Recent

Author Topic: HTML files get values  (Read 20403 times)

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #45 on: June 07, 2020, 08:31:43 pm »
does this work on windows ?
Code: Pascal  [Select][+][-]
  1. procedure TForm1.downloadBook(bookname: String);
  2. var
  3.   HTTP: TFPHttpClient;
  4.   Stream: TMemoryStream;
  5. begin
  6.   HTTP := TFPHttpClient.Create(nil);
  7.   Stream := TMemoryStream.Create;
  8.   try
  9.     HTTP.AllowRedirect := false;
  10.     HTTP.AddHeader('User-Agent', 'Wget/1.20.1 (linux-gnu)');
  11.     HTTP.HTTPMethod('GET', bookname, Stream, [200, 301]);
  12.     if HTTP.ResponseStatusCode = 301 then
  13.     begin
  14.       bookname  := HTTP.GetHeader(HTTP.ResponseHeaders,'Location');
  15.       bookname := StringReplace(bookname, ' ', '%20', [rfReplaceAll]); // IMPORTANT
  16.       HTTP.HTTPMethod('GET', bookname, Stream, [200]);
  17.     end;
  18.     Stream.SaveToFile(targetDirectory+'test.pdf');
  19.   finally
  20.     HTTP.Free;
  21.     Stream.Free;
  22.   end;
  23.  
  24.  
  25.  
  26. end;                    
it is not downloading , i dont know what is the issue

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #46 on: June 07, 2020, 10:17:56 pm »
hi
Quote
https://www.pdfdrive.com/search?q=somethingelse&pagecount=&pubyear=&searchin=&more=true
i have done this before but i got error when you are searching for more than word as example " visual basic " or visual basic 6 "  i got error
You need to urlencode the edit1.text before adding ut in the url.

For example... an url may not contain spaces, slashes, ampersand (&) etc.

For now you mentioned spaces. Replace those %20 (the same as I did with the redirect url).

There is a routine in synapse which does this for you but I can't remember which.
(I think encodeurl() in synacode unit)

Edit: ah, you already found it. But it's best to use the encodeurl so other characters like / and & are also correctly encoded.

Quote from: alaa123456789 link=topic=49927.msg365197#msg365197
it is not downloading , i dont know what is the issue
What error are you getting?

Are you passing the complete url? Because the first get you only do on bookname.
If bookname is just only the name of the book or search you need to put it in a complete url.
« Last Edit: June 07, 2020, 10:23:03 pm by rvk »

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #47 on: June 08, 2020, 06:43:58 pm »
hi
Quote
You need to urlencode the edit1.text before adding ut in the url.

For example... an url may not contain spaces, slashes, ampersand (&) etc.

For now you mentioned spaces. Replace those %20 (the same as I did with the redirect url).
please see attached capture i got error 400 .

Quote
What error are you getting?

Are you passing the complete url? Because the first get you only do on bookname.
If bookname is just only the name of the book or search you need to put it in a complete url.
i am adding the same url i shared with you before i.e
download.pdf?id=158527426&h=2a2e7156d5eb07e0bb5d263b666d9052&u=cache&ext=pdf
Code: Pascal  [Select][+][-]
  1. x4:=('https://www.pdfdrive.com/'+ re2.Match[1]);
  2.              downloadBook(x4);
i got only 1kb
this what i am facing

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #48 on: June 08, 2020, 06:51:52 pm »
please see attached capture i got error 400 .
In the 1kb file you'll see a bad request.
That means the server doesn't understand the request.
For me it was that spaces are not interpreted correctly.

Try a search without spaces and without special characters.
How does the baseurl look like before the simpleget?

TRon

  • Hero Member
  • *****
  • Posts: 2537
Re: HTML files get values
« Reply #49 on: June 08, 2020, 07:26:10 pm »
For a 'download' url the site seem to want you to comply to default url standards, for a search it seems the site expects spaces (that you've put up at the search bar) to be converted to plus a sign (whether you quote your search term or not)

In that respect it is not exactly rocket-science  :), you can research by attempting every possible input that you can think of and, see what the server code is doing to the url, you would have to mimic that behaviour in your own code.

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #50 on: June 08, 2020, 07:34:06 pm »
Quote
In the 1kb file you'll see a bad request.
please see capture7
Quote
For me it was that spaces are not interpreted correctly.
please see capture8

Quote
In that respect it is not exactly rocket-science  :), you can research by attempting every possible input that you can think of and, see what the server code is doing to the url, you would have to mimic that behaviour in your own code.
i am new in lazarus , i am trying my best , i am working on this project since a month and trying step by step but i am facing challenges as i am not familiar with all methode
but i put in my mind i will overcome this challenge
thanks for everyone support me

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #51 on: June 08, 2020, 07:37:59 pm »
You used utf8encode. That does nothing for spaces.
I mentioned encodeurl in the synacode unit.
That will convert spaces to %20 or + (not sure which).

Your baseurl still has spaces in it so it fails.

TRon

  • Hero Member
  • *****
  • Posts: 2537
Re: HTML files get values
« Reply #52 on: June 08, 2020, 07:45:04 pm »
i am new in lazarus , i am trying my best , i am working on this project since a month and trying step by step but i am facing challenges as i am not familiar with all methode
My remark was not meant as a (negative) comment on your coding skills, or lack thereof because you are unfamiliar with Lazarus/pascal.

Take you second screenshot (the search). you searched for visual basic by using your edit box. Now open your browser and do that same search manually. Despite that the website 'catches' common searches you can force the website to actually perform a search (instead of showing a cached page).

Then you can see that the website turns the search term into:
Code: [Select]
https://www.pdfdrive.com/search?q=%22visual+basic%22&pagecount=&pubyear=&searchin=
The %22 represent the quotes, so you can omit those (i manually added those to circumvent the cached page). so that the url you need to use for searching for visual basic becomes:
Code: [Select]
https://www.pdfdrive.com/search?q=visual+basic&pagecount=&pubyear=&searchin=
I'm positive that that website has more things up it sleeves but, for now i was unable to detect them.

Give that plus sign a try  :D

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #53 on: June 08, 2020, 07:51:00 pm »
Give that plus sign a try  :D
Both a plus sign and %20 will work for spaces.
It's just that real spaces don't work.
And even if there are spaces in an url, every browser converts them before sending the url.

Using the encodeurl (or any proper url encoder function) will work for other characters too.

TRon

  • Hero Member
  • *****
  • Posts: 2537
Re: HTML files get values
« Reply #54 on: June 08, 2020, 07:59:28 pm »
@rvk,
I have not claimed that it wouldn't. I just crossed your post (i was editing while you posted yours for that i apologise as that interfered with your support.

It was not my intention to mingle with your business, but i probably skimmed to quickly over the thread as i was unable to locate you mentioning urlencode/decode earlier :-[. Using that should be good enough. In fact i wouldn't want to have it any other way   :) (i like to see for myself what the website does itself first, as urlencode/decode is just a black box to me)


rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #55 on: June 08, 2020, 08:09:43 pm »
It was not my intention to mingle with your business, but i probably skimmed to quickly over the thread as i was unable to locate you mentioning urlencode/decode earlier :-[.
O, I don't consider it mingling. Everybody can pitch in  8)
The more the better  :D

I even didn't think about the + signs at first. Sometimes they are converted to +. Not sure when %20 is used and when +.

For the redirect from the site itself, the filenames also contain spaces in the location header (which I found weird). For that one, manually replacing might be sufficuent and even needed. Because if the filename contains a special characters and that one is already encoded, you don't want to double encode.

For the edit1.text you probably do want the encodeurl because there you can enter special characters like %, & etc.


TRon

  • Hero Member
  • *****
  • Posts: 2537
Re: HTML files get values
« Reply #56 on: June 08, 2020, 08:24:47 pm »
O, I don't consider it mingling. Everybody can pitch in  8)
The more the better  :D
Ok, then  8-)

Quote
I even didn't think about the + signs at first. Sometimes they are converted to +. Not sure when %20 is used and when +.
tbh, i am not sure either. I am positive i once read about it somewhere in a RFC and then forgot all about it again :-)

Quote
For the redirect from the site itself, the filenames also contain spaces in the location header (which I found weird). For that one, manually replacing might be sufficuent and even needed. Because if the filename contains a special characters and that one is already encoded, you don't want to double encode.
Indeed the website does some non-standard things, as an other example look how it interpreted the quotes. That is pretty uncommon.

Quote
For the edit1.text you probably do want the encodeurl because there you can enter special characters like %, & etc.
I fully agree. At least this way TS now knows why he has to use it (or something similar). If you try to do that encoding manually... *beh* And for sure you never know what the user enters in there  :)

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #57 on: June 08, 2020, 08:46:43 pm »
hey guys , this subject is open for all to input his knowledge everyone is welcomed,  at the end we need everyone to learn and share experience
i tried to encodeurl() from synacode and it worked for search
still we have issue with download book

actually this website have a lot to learn about web scraping and we trying to learn how to resolve all these challenges

thanks
Alaa   

rvk

  • Hero Member
  • *****
  • Posts: 6169
Re: HTML files get values
« Reply #58 on: June 09, 2020, 01:00:31 pm »
still we have issue with download book
What issues?
I showed some code already where the download worked for me.

alaa123456789

  • Sr. Member
  • ****
  • Posts: 260
  • Try your Best to learn & help others
    • youtube:
Re: HTML files get values
« Reply #59 on: June 10, 2020, 06:48:05 pm »
Quote
What issues?

it doesnt show any error , only file 1kb , i tried to use urlmon also same

thanks
alaa

 

TinyPortal © 2005-2018