Recent

Author Topic: Saving web page to a file  (Read 4791 times)

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Saving web page to a file
« Reply #15 on: May 18, 2020, 05:32:41 pm »
A web page that needs redirection is processed correctly with this code; otherwise not. All pages (that I have seen) have a newURI.Host that = '' and only pages with redirects need to have the ASrc URL encoded. Pages without redirects do not need any "encoding" of their URL.
Uh... every website is forced to redirect in case a page requires redirection ? If not, then the webserver is not following the rules of the internets (if even a 404), so indeed the ondirect event will not work for you as it never will result in an empty URL.

I'm not sure what you are trying to accomplish, but....
Quote
How can I tell when the web page needs to go through this loop. It doesn't seem to be when
(newURI.Host = '')
... that seems to suggest that you wish to process the url (in some for me uncomprehending way) before (fphttpclient is) doing an actual redirection.

If that is the case then set AllowRedirect to false, and handle the returned error-code from the webserver yourself, that should be in the 303 range (https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_Redirection)

If not mistaken then you can set the allowed error-codes, so that fphttpclient won't bail out with an exception, and you can check for that return value. You would have to redirect manually in case that is still required.
Today is tomorrow's yesterday.

bobonwhidbey

  • Hero Member
  • *****
  • Posts: 630
    • Double Dummy Solver - free download
Re: Saving web page to a file
« Reply #16 on: May 18, 2020, 05:54:05 pm »
At the moment, I know in advance which URLs need to go through the Redirect loop, and which don't. So I've added a Redirect : boolean property to TMyHTTPClient and define that in my program, depending on my hoped for knowledge of the URL's needs. Of course I feel a bit at the mercy of the owner of the web site who may change things in the future, but that's always been the case - especially when you're parsing HTML code that is under someone else's control.

BTW THtmlTextExtractor is a nice tool for parsing HTML.
Lazarus 3.8 FPC 3.2.2 x86_64-win64-win32/win64

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Saving web page to a file
« Reply #17 on: May 18, 2020, 06:24:24 pm »
Ok, so if i understood you correct then indeed wish to know when a redirect is taking place so that you can handle the redirection yourself.

I found this thread (https://forum.lazarus.freepascal.org/index.php/topic,29262.0.html) that is perhaps able to help you out. User evens has a loop that handles the number of redirections allowed, but the basic principle should apply to your situation as well.

If you thought adding a method to a class is cumbersome ;-)

fwiw if i need a quick way to do that i use:
Code: Pascal  [Select][+][-]
  1. type
  2.   TEventHandler = object
  3.    public
  4.     procedure CheckURI(Sender: TObject; const ASrc: string; var ADest: string);
  5.   end;
  6.  
  7. var
  8.   EventHandler : TEventHandler;
  9. ...
  10. Client.OnRedirect    := @EventHandler.CheckURI;
  11.  
Today is tomorrow's yesterday.

trev

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2032
  • Former Delphi 1-7, 10.2 user
Re: Saving web page to a file
« Reply #18 on: May 19, 2020, 01:06:39 am »
The web site I'm trying to retrieve relies on a password being saved in the default browser's cookies. That cookie doesn't seemed to be retrieved with your GetMicrochipPage code. Any idea?

No idea - it should handle cookies. There is a WinAPI flag for InternetOpenURL (INTERNET_FLAG_NO_COOKIES) to disable cookies, but I'm not using it in the code.

bobonwhidbey

  • Hero Member
  • *****
  • Posts: 630
    • Double Dummy Solver - free download
Re: Saving web page to a file
« Reply #19 on: May 19, 2020, 04:18:31 pm »
Neither the GetMicroChipPage nor the TMyHttpClient approach works in my situation because a password to the site needs to be stored as a cookie in the browser. That password is stored in the default browser's cookies and for that reason the OpenDocument approach works.

The OpenDocument(URL) has succeeded because I tack  '&download=Myfile.txt' to the end of the URL. Not only does the site appear in the browser (which is opened automatically) but a file (Myfile.txt) is stored in the Downloads folder.

Is it possible to grab the html coding, just like GetaMicroChipPage and the TMyHttpClient approaches, while using the default browser. I'd prefer to avoid saving the Myfile.txt file and get directly to the task of parsing the HTML code.
Lazarus 3.8 FPC 3.2.2 x86_64-win64-win32/win64

 

TinyPortal © 2005-2018