Recent

Author Topic: (solved) parsing html in freepascal  (Read 6071 times)

billyb123

  • New Member
  • *
  • Posts: 26
(solved) parsing html in freepascal
« on: January 28, 2017, 05:28:20 am »
ok, first i downloaded some pages using fphttpclient, i want to parse html to get some data out of `em, but readxmlfile throws an exception because downloaded html is not valid xml (i think). html goes something like this:
Code: [Select]
<div id="testing" multiple>
  this is just a simple example
</div>

readxmlfile throws an exception saying "= is missing" after the "multiple".
is there any way to "configure" readxmlfile to accept such html like the one in the example above?
what do you guys use to parse html?

Code: [Select]
program cine;
uses
  sysutils, classes, dom, fphttpclient, xmlread;
var
  doc: txmldocument;
begin
  doc := nil;
  readxmlfile (doc, 'file.xml');
end.

« Last Edit: January 29, 2017, 04:03:35 am by billyb123 »

derek.john.evans

  • Guest
Re: parsing html in freepascal
« Reply #1 on: January 28, 2017, 08:52:41 am »
Take a look at the unit SAX_HTML

ie: ReadHTMLFile()

Bart

  • Hero Member
  • *****
  • Posts: 3539
    • Bart en Mariska's Webstek
Re: parsing html in freepascal
« Reply #2 on: January 28, 2017, 03:25:47 pm »
Or FastHtmlParser (comes with fpc).

Bart

Phil

  • Hero Member
  • *****
  • Posts: 2750
Re: parsing html in freepascal
« Reply #3 on: January 28, 2017, 05:19:33 pm »
ok, first i downloaded some pages using fphttpclient, i want to parse html to get some data out of `em, but readxmlfile throws an exception because downloaded html is not valid xml (i think). html goes something like this:
Code: [Select]
<div id="testing" multiple>
  this is just a simple example
</div>

That not valid HTML. Review the div tag's attributes. I would expect any HTML parser to choke on that too,

Are you screen scraping? Or is this a file of your own making? If the latter, why not use XML or JSON rather than HTML?



Leledumbo

  • Hero Member
  • *****
  • Posts: 8111
  • Programming + Glam Metal + Tae Kwon Do = Me
Re: parsing html in freepascal
« Reply #4 on: January 28, 2017, 11:24:43 pm »
ReadXMLFile, as XML standard requires, does strict parsing. Use ReadHTMLFile from SAX_HTML like this:
Code: Pascal  [Select]
  1. uses
  2.   classes,sax_html,dom_html,dom;
  3. const
  4.   testdata = '<div id="testing" multiple>'
  5.            + 'this is just a simple example'
  6.            + '</div>'
  7.            ;
  8. var
  9.   doc: thtmldocument;
  10.   els: tdomnodelist;
  11. begin
  12.   readhtmlfile(doc,tstringstream.create(testdata));
  13.   els := doc.GetElementsByTagName('div');
  14.   if els.Count > 0 then begin
  15.     writeln(tdomelement(els[0]).getattribute('id'));
  16.     writeln(tdomelement(els[0]).getattribute('multiple'));
  17.     writeln(tdomelement(els[0]).textcontent);
  18.   end;
  19. end.
  20.  

billyb123

  • New Member
  • *
  • Posts: 26
Re: parsing html in freepascal
« Reply #5 on: January 29, 2017, 04:03:21 am »
thanks guys, readhtmlfile works!