Lazarus

Free Pascal => General => Topic started by: billyb123 on January 28, 2017, 05:28:20 am

Title: (solved) parsing html in freepascal
Post by: billyb123 on January 28, 2017, 05:28:20 am
ok, first i downloaded some pages using fphttpclient, i want to parse html to get some data out of `em, but readxmlfile throws an exception because downloaded html is not valid xml (i think). html goes something like this:
Code: [Select]
<div id="testing" multiple>
  this is just a simple example
</div>

readxmlfile throws an exception saying "= is missing" after the "multiple".
is there any way to "configure" readxmlfile to accept such html like the one in the example above?
what do you guys use to parse html?

Code: [Select]
program cine;
uses
  sysutils, classes, dom, fphttpclient, xmlread;
var
  doc: txmldocument;
begin
  doc := nil;
  readxmlfile (doc, 'file.xml');
end.

Title: Re: parsing html in freepascal
Post by: derek.john.evans on January 28, 2017, 08:52:41 am
Take a look at the unit SAX_HTML

ie: ReadHTMLFile()
Title: Re: parsing html in freepascal
Post by: Bart on January 28, 2017, 03:25:47 pm
Or FastHtmlParser (comes with fpc).

Bart
Title: Re: parsing html in freepascal
Post by: Phil on January 28, 2017, 05:19:33 pm
ok, first i downloaded some pages using fphttpclient, i want to parse html to get some data out of `em, but readxmlfile throws an exception because downloaded html is not valid xml (i think). html goes something like this:
Code: [Select]
<div id="testing" multiple>
  this is just a simple example
</div>

That not valid HTML. Review the div tag's attributes. I would expect any HTML parser to choke on that too,

Are you screen scraping? Or is this a file of your own making? If the latter, why not use XML or JSON rather than HTML?


Title: Re: parsing html in freepascal
Post by: Leledumbo on January 28, 2017, 11:24:43 pm
ReadXMLFile, as XML standard requires, does strict parsing. Use ReadHTMLFile from SAX_HTML like this:
Code: Pascal  [Select]
  1. uses
  2.   classes,sax_html,dom_html,dom;
  3. const
  4.   testdata = '<div id="testing" multiple>'
  5.            + 'this is just a simple example'
  6.            + '</div>'
  7.            ;
  8. var
  9.   doc: thtmldocument;
  10.   els: tdomnodelist;
  11. begin
  12.   readhtmlfile(doc,tstringstream.create(testdata));
  13.   els := doc.GetElementsByTagName('div');
  14.   if els.Count > 0 then begin
  15.     writeln(tdomelement(els[0]).getattribute('id'));
  16.     writeln(tdomelement(els[0]).getattribute('multiple'));
  17.     writeln(tdomelement(els[0]).textcontent);
  18.   end;
  19. end.
  20.  
Title: Re: parsing html in freepascal
Post by: billyb123 on January 29, 2017, 04:03:21 am
thanks guys, readhtmlfile works!