Recent

Author Topic: SAX HTML  (Read 1360 times)

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 885
Re: SAX HTML
« Reply #15 on: October 15, 2020, 05:07:38 pm »
Page 5 of 86 is easy to find but what are you look for after ? 
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

pcurtis

  • Sr. Member
  • ****
  • Posts: 384
Re: SAX HTML
« Reply #16 on: October 15, 2020, 05:25:24 pm »
I suppose I can get the total number of pages from

<table class="tborder" cellpadding="3" cellspacing="1" border="0">
<tr>
   <td class="vbmenu_control" style="font-weight:normal">Page 5 of 86</td>
Windows 10 / Linux Mint 20
Laz 2.10.0
FPC 3.2.0

wp

  • Hero Member
  • *****
  • Posts: 7958
Re: SAX HTML
« Reply #17 on: October 15, 2020, 06:39:11 pm »
Code: Pascal  [Select][+][-]
  1. program Project1;
  2.  
  3. uses
  4.   FastHtmlParser;
  5.  
  6. type
  7.   TTableParser = class(THTMLParser)
  8.   private
  9.     InTable: Boolean;
  10.     InRow: Boolean;
  11.     InCell: Boolean;
  12.     FoundText: String;
  13.   public
  14.     procedure TagFound(NoCaseTag, ActualTag: string);
  15.     procedure TextFound(Text: String);
  16.   end;
  17.  
  18. const
  19.   HTML =
  20.     '<table class="tborder" cellpadding="3" cellspacing="1" border="0">'+
  21.     '<tr>'+
  22.       '<td class="vbmenu_control" style="font-weight:normal">Page 5 of 86</td>'+
  23.       '<td class="alt1" nowrap="nowrap"><a rel="start" class="smallfont" href="t29502-product-x.html" title="First Page - Results 1 to 10 of 858"><strong>&laquo;</strong> First</a></td>'+
  24.       '<td class="alt1"><a rel="prev" class="smallfont" href="t29502-p85-product-x.html" title="Prev Page - Results 841 to 850 of 858">&lt;</a></td>'+
  25.       '<td class="alt1"><a class="smallfont" href="t29502-p36-product-x.html" title="Show results 351 to 360 of 858"><!---50-->36</a></td><td class="alt1"><a class="smallfont" href="t29502-p76-product-x.html" title="Show results 751 to 760 of 858"><!---10-->76</a></td><td class="alt1"><a class="smallfont" href="t29502-p77-product-x.html" title="Show results 761 to 770 of 858">77</a></td><td class="alt1"><a class="smallfont" href="t29502-p78-product-x.html" title="Show results 771 to 780 of 858">78</a></td><td class="alt1"><a class="smallfont" href="t29502-p79-product-x.html" title="Show results 781 to 790 of 858">79</a></td><td class="alt1"><a class="smallfont" href="t29502-p80-product-x.html" title="Show results 791 to 800 of 858">80</a></td><td class="alt1"><a class="smallfont" href="t29502-p81-product-x.html" title="Show results 801 to 810 of 858">81</a></td><td class="alt1"><a class="smallfont" href="t29502-p82-product-x.html" title="Show results 811 to 820 of 858">82</a></td><td class="alt1"><a class="smallfont" href="t29502-p83-product-x.html" title="Show results 821 to 830 of 858">83</a></td><td class="alt1"><a class="smallfont" href="t29502-p84-product-x.html" title="Show results 831 to 840 of 858">84</a></td><td class="alt1"><a class="smallfont" href="t29502-p85-product-x.html" title="Show results 841 to 850 of 858">85</a></td>   <td class="alt2"><span class="smallfont" title="Showing results 851 to 858 of 858"><strong>86</strong></span></td>'+
  26.     '</tr>'+
  27.     '</table>';
  28.  
  29. var
  30.   parser: TTableParser;
  31.  
  32.  
  33.   procedure TTableParser.TagFound(NoCaseTag, ActualTag: String);
  34.   begin
  35.     if pos('<TABLE ', NoCaseTag) = 1 then
  36.       InTable := true
  37.     else
  38.     if InTable and ('<TR>' = NoCaseTag) then
  39.       InRow := true
  40.     else
  41.     if InRow and (pos('<TD ', NoCaseTag) = 1) then
  42.       InCell := true;
  43.   end;
  44.  
  45.   procedure TTableParser.TextFound(Text: String);
  46.   begin
  47.     if InCell then
  48.     begin
  49.       FoundText := Text;
  50.       Done := true;  // stop parsing any further.
  51.     end;
  52.   end;
  53.  
  54. begin
  55.   parser := TTableParser.Create(HTML);
  56.   try
  57.     parser.OnFoundTag := @parser.TagFound;
  58.     parser.OnFoundText := @parser.TextFound;
  59.     parser.Exec;
  60.     WriteLn(parser.FoundText);
  61.   finally
  62.     parser.Free;
  63.   end;
  64.  
  65.   ReadLn;
  66. end.

... but works only when there is no other table in the HTML and when the Page number is in the first cell of the first row. Otherwise there must be more sophisticated handling of the InTable, InRow and InCell (or other) flags.
Mainly Lazarus trunk / fpc 3.2.0 / all 32-bit on Win-10, but many more...

pcurtis

  • Sr. Member
  • ****
  • Posts: 384
Re: SAX HTML
« Reply #18 on: October 15, 2020, 07:19:54 pm »
Thanks. I'll take a look tomorrow.
Windows 10 / Linux Mint 20
Laz 2.10.0
FPC 3.2.0

BeniBela

  • Hero Member
  • *****
  • Posts: 770
    • homepage
Re: SAX HTML
« Reply #19 on: October 15, 2020, 10:16:54 pm »
I think that the package isn't very stable

It is very stable

It just has become so complex that fpc cannot really compile it anymore

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 885
Re: SAX HTML
« Reply #20 on: October 16, 2020, 08:34:28 am »
hello,
finally, it seems that the Xpath unit of the fcl-xml works well with html documents.
here is a small project (in attachment) using xpath for searching some elements in html pages.
1 - Search for all the title attributes of recent_topics in home page of lazarus forum.
2 - Search for all the href attributes of the defined tree path in the pcurtis html table.
3 - Search for td which contains "Page" string  in the pcurtis html table.

Code: Pascal  [Select][+][-]
  1. program parseHtmlTest;
  2. // J.P October 2020
  3. {$mode objfpc}{$H+}
  4.  
  5. uses classes, fphttpclient, opensslsockets, DOM, DOM_HTML, SAX_HTML, XPath;
  6. var
  7.   htmlDoc: THTMLDocument;
  8.   XPathRes: TXPathVariable;
  9.   XPathExp: DomString;
  10.   TheNodeSet : TNodeSet;
  11.   s: String;
  12.   htmlStream: TStringStream;
  13.  
  14.   procedure DisplayResult( NS: TNodeSet);
  15.   var TheNode : Pointer;
  16.   begin
  17.      For TheNode in NS  do
  18.      begin
  19.         Writeln(TDomNode(TheNode).TextContent);
  20.      end;
  21.      Writeln('===================================');
  22.   end;
  23.  
  24. {$R *.res}
  25.  
  26. begin
  27.   try
  28.    s := TFPCustomHTTPClient.SimpleGet('https://forum.lazarus.freepascal.org/index.php');
  29.    htmlStream := TStringStream.Create(s);
  30.    ReadHTMLFile(htmlDoc, htmlStream);
  31.    // search for all the title attributes of recent_topics in home page of lazarus forum
  32.    XPathExp := '//ul[@class="recent_topics"]/li/a/@title';
  33.    XPathRes := EvaluateXPathExpression(XPathExp, htmlDoc.DocumentElement);
  34.    TheNodeSet := XPathRes.AsNodeSet;
  35.    DisplayResult(TheNodeSet);
  36.    // read input html file
  37.    ReadHTMLFile(htmlDoc, 'table.html');
  38.    // search for all the href attributes in the defined tree path
  39.    XPathExp := '//table[@class="tborder"]/tr/td/a/@href';
  40.    XPathRes := EvaluateXPathExpression(XPathExp, htmlDoc.DocumentElement);
  41.    TheNodeSet := XPathRes.AsNodeSet;
  42.    DisplayResult(TheNodeSet);
  43.    // Search for td which contains "Page" string
  44.    XPathExp := '//td[contains(text(),"Page")]';
  45.    XPathRes := EvaluateXPathExpression(XPathExp, htmlDoc.DocumentElement);
  46.    TheNodeSet := XPathRes.AsNodeSet;
  47.    DisplayResult(TheNodeSet);
  48.    XPathRes.Free;
  49.  
  50.   finally
  51.      htmlDoc.Free;
  52.      htmlStream.Free;
  53.   end;
  54.   Readln;
  55.  
  56. end.

tested with Lazarus 2.0.10 fpc 3.2.0 on windows 10 and Centos 8.1

Result in attachment.

Friendly, J.P
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

PascalDragon

  • Hero Member
  • *****
  • Posts: 2605
  • Compiler Developer
Re: SAX HTML
« Reply #21 on: October 17, 2020, 12:33:17 pm »
It just has become so complex that fpc cannot really compile it anymore

Have such problems been reported?

pcurtis

  • Sr. Member
  • ****
  • Posts: 384
Re: SAX HTML
« Reply #22 on: October 17, 2020, 01:03:55 pm »
I haven't forgot this thread. I'm just a little sidetracked.

Please bear with me.
Windows 10 / Linux Mint 20
Laz 2.10.0
FPC 3.2.0


PascalDragon

  • Hero Member
  • *****
  • Posts: 2605
  • Compiler Developer
Re: SAX HTML
« Reply #24 on: October 17, 2020, 03:02:08 pm »

 

TinyPortal © 2005-2018