Using TGeckoBrowser to Get HTML Code

Vanilla

Newbie
Posts: 3

Using TGeckoBrowser to Get HTML Code

« on: February 07, 2012, 01:26:04 pm »

Hello,

I've been trying to find a way to get the HTML source code of a webpage with TGeckoBrowser. So far, the only thing I've found is this:

Code: [Select]

procedure TGeckoBrowser.LoadHTML(htmlcode: String);
var
  wbchrome: nsIWebBrowserChrome;
  wb: nsIWebBrowser;
  domwindow: nsIDOMWindow;
  domdoc: nsIDOMDocument;
  domhtmldoc: nsIDOMHTMLDocument;
  nsstr: IInterfacedString;
begin
  wbchrome := Self as nsIWebBrowserChrome;
  wbchrome.GetWebBrowser(wb);
  wb.GetContentDOMWindow(domwindow);
  domwindow.GetDocument(domdoc);
  domhtmldoc:= domdoc as nsIDOMHTMLDocument;

  nsstr:= NewString;
  nsstr.Assign(htmlcode);
  domhtmldoc.Write(nsstr.AString);
end;

Basically, it's a code to load the HTML source code of a page to our own HTML code. I made a few modifications.

Quote

procedure TGeckoBrowser.LoadHTML(htmlcode: String);
var
wbchrome: nsIWebBrowserChrome;
wb: nsIWebBrowser;
domwindow: nsIDOMWindow;
domdoc: nsIDOMDocument;
domhtmldoc: nsIDOMHTMLDocument;
begin
wbchrome := Self as nsIWebBrowserChrome;
wbchrome.GetWebBrowser(wb);
wb.GetContentDOMWindow(domwindow);
domwindow.GetDocument(domdoc);
domhtmldoc:= domdoc as nsIDOMHTMLDocument;
end;

But I have a problem. How can I export the code in domhtmldoc into a tstringlist or a TXT file? Right now, it's stuck in a nsIDOMHTMLDocument and there is nothing that I can do.

Regards,
Vanilla

Logged

ludob

Hero Member
Posts: 1173

Re: Using TGeckoBrowser to Get HTML Code

« Reply #1 on: February 07, 2012, 02:21:38 pm »

You can't. The document is already parsed and you can only access individual elements as they are in their current state.
Most html viewers reload the document from server or cache but non of that is supported in the TGeckoBrowser.

Logged

Vanilla

Newbie
Posts: 3

Re: Using TGeckoBrowser to Get HTML Code

« Reply #2 on: February 07, 2012, 10:33:58 pm »

Thank you for your answer.

There is absolutely no workaround, tweak or trick?

Vanilla

Logged

Leledumbo

Hero Member
Posts: 8757
Programming + Glam Metal + Tae Kwon Do = Me

Re: Using TGeckoBrowser to Get HTML Code

« Reply #3 on: February 08, 2012, 03:02:27 am »

Quote

There is absolutely no workaround, tweak or trick?

Traverse the domhtmldoc manually to form the source code back (formatting would be gone of course, and the attributes are adjusted to their current state).

Logged

Follow this if you want me to answer: http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

http://pascalgeek.blogspot.com
https://bitbucket.org/leledumbo
https://github.com/leledumbo
Code first, think later - Natural programmer B)

Vanilla

Newbie
Posts: 3

Re: Using TGeckoBrowser to Get HTML Code

« Reply #4 on: February 08, 2012, 04:31:05 am »

I've been thinking of that, but this is the first time I see things like nsIDOMHTMLDocument or nsIDOMWindow. And it seems that everything that normally works doesn't right now. I would appreciate a pointer or two on how to traverse the domhtmldoc manually.

Logged

ludob

Hero Member
Posts: 1173

Re: Using TGeckoBrowser to Get HTML Code

« Reply #5 on: February 08, 2012, 08:56:01 am »

Quote

I would appreciate a pointer or two on how to traverse the domhtmldoc manually.

There is http://wiki.lazarus.freepascal.org/GeckoPort, the examples and a long thread on this forum http://forum.lazarus.freepascal.org/index.php?topic=15352.0 that went in to the dark "bowels" of TGeckoBrowser.
A piece of advise to gain quite some time figuring out how things work:
- install firebug in firefox
- load a page you want to analyze
- open firebug, go to console and browse the DOM from the command line starting with 'document'. Code completion works nicely. For example, to try what GetElementsByTagName('*') gives, type document.GetEleGetElementsByTagName("*") in the console.
- when you find what you need, translate to pascal. nsIDOMElement has a lot of descendants and some methods you've used in the firebug console will be in the descendants of nsIDOMElement. firebug uses late binding a discovers all methods of the interface (it talks to the descendant) while TGeckoBrowser uses early binding and you need to tell the compiler which descendant you are talking to. This is done by an explicit "cast". Example: set the value of an 'input" element becomes '(nsIDOMElement as nsIDOMHTMLInputElement).SetValue(s.AString);' . The file nsXPCOM.pas has all the interface declarations and a quick search for the method will give you the name of the descendant(s). The thread mentioned earlier has many examples.
-warning: TGeckoBrowser uses an old version of XULrunner. You'll encounter methods in firebug that are not implemented yet in the old XULrunner.
-above method works for more than browsing. You can actually test out modifying attributes, adding/removing elements, etc.

Quote

Traverse the domhtmldoc manually to form the source code back (formatting would be gone of course, and the attributes are adjusted to their current state).

Everything is adjusted to their current state. If there is a piece of javascript that adds or removes a few items then you see only the result and no way to find out what was done in the html and what was done in the script. If in the OnLoad of the body a new page is loaded (script redirection), you"ll get the new page only and won't find any trace of the initial page.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Using TGeckoBrowser to Get HTML Code (Read 5586 times)

Vanilla

Using TGeckoBrowser to Get HTML Code

ludob

Re: Using TGeckoBrowser to Get HTML Code

Vanilla

Re: Using TGeckoBrowser to Get HTML Code

Leledumbo

Re: Using TGeckoBrowser to Get HTML Code

Vanilla

Re: Using TGeckoBrowser to Get HTML Code

ludob

Re: Using TGeckoBrowser to Get HTML Code

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook