Recent

Author Topic: [SOLVED] HTML read tag  (Read 882 times)

Pe3s

  • Hero Member
  • *****
  • Posts: 533
[SOLVED] HTML read tag
« on: May 02, 2023, 09:28:00 pm »
Hello forum users, is it possible to read the content of tags from an html file and display them in ListView?
Regards :)
« Last Edit: May 03, 2023, 08:41:11 am by Pe3s »

waltfair

  • New Member
  • *
  • Posts: 21
  • Walt Fair, PHD, PE. Engineer and software junkie
Re: HTML read tag
« Reply #1 on: May 02, 2023, 10:12:18 pm »
Since html is a text file , you could of course parse the text and extract the tags, but
I assume you're asking for something more elegant, but I can't help you there.
Good luck,
Walt Fair PHD, PE

dsiders

  • Hero Member
  • *****
  • Posts: 1080
Re: HTML read tag
« Reply #2 on: May 02, 2023, 10:26:43 pm »
Hello forum users, is it possible to read the content of tags from an html file and display them in ListView?

Regards :)

FPC has several options for parsing HTML:

* fasthtmlparser (simplistic but fast for tag or text values)
* THTMLReader (Fires events when document, element, or attribute components are found)
* THTMLDocument (A DOM document where you do all the heavy lifting)

Demo code or example usage in fcl or the Lazarus IDE available for each.

Adding the results from one of these to a list view depends on the parser/analyzer selected. But that should be the trivial part of the exercise.

Preview Lazarus 3.99 documentation at: https://dsiders.gitlab.io/lazdocsnext

dsiders

  • Hero Member
  • *****
  • Posts: 1080
Re: HTML read tag
« Reply #3 on: May 03, 2023, 08:01:05 am »
Hello forum users, is it possible to read the content of tags from an html file and display them in ListView?

Regards :)

FPC has several options for parsing HTML:

* fasthtmlparser (simplistic but fast for tag or text values)
* THTMLReader (Fires events when document, element, or attribute components are found)
* THTMLDocument (A DOM document where you do all the heavy lifting)

Demo code or example usage in fcl or the Lazarus IDE available for each.

Adding the results from one of these to a list view depends on the parser/analyzer selected. But that should be the trivial part of the exercise.

I was curious how each of the options performed, so I put together the attached demo. Ironically, the option with the word "Fast" in its name... wasn't really all that fast. Go figure.

Preview Lazarus 3.99 documentation at: https://dsiders.gitlab.io/lazdocsnext

Pe3s

  • Hero Member
  • *****
  • Posts: 533
Re: HTML read tag
« Reply #4 on: May 03, 2023, 08:40:42 am »
Thank you @waltfair, dsiders :)

madref

  • Hero Member
  • *****
  • Posts: 949
  • ..... A day not Laughed is a day wasted !!
    • Nursing With Humour
Re: [SOLVED] HTML read tag
« Reply #5 on: May 03, 2023, 02:04:32 pm »
You treat a disease, you win, you lose.
You treat a person and I guarantee you, you win, no matter the outcome.

Lazarus 3.99 (rev main_3_99-649-ge13451a5ab) FPC 3.3.1 x86_64-darwin-cocoa
Mac OS X Monterey

wp

  • Hero Member
  • *****
  • Posts: 11916
Re: HTML read tag
« Reply #6 on: May 03, 2023, 02:19:33 pm »
I was curious how each of the options performed, so I put together the attached demo. Ironically, the option with the word "Fast" in its name... wasn't really all that fast. Go figure.
Thank you, dsiders, for this demo. This is the first time that I see the sax reader in action explicitely.

Nevertheless, I doubt whether this speed test is of any value to see the speed differences between the three parsers. This is because you add the parsed results to a listview, and this introduces a lot of overhead. Moreover, the tests are not written inconsistently, in one test clearing the listview is included in the measurement code, as well as extraction of the html from the memo.Text (but this is negligible in comparison with the first issue).

I rewrote the demo so that the results are first written to a stringlist. Now the time to parse your (small) demo file is displayed as 0 ms for all three cases. Found an html file in the docs/html/lazutils folder which is greater than 1 MB (uni936c.html), and for this the times are measured as follows:
  • fasthtmlparser: 31 ms
  • SAX parser: 187 ms
  • DOM parser: 656 ms.
(The time to display the output in the listview, however, is "endless", about 1 minute so so. Should do this in virtual mode, but did not want to spend so much time on it).

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: [SOLVED] HTML read tag
« Reply #7 on: May 03, 2023, 07:08:39 pm »
Or the parser from my internet tools.

Like a dom, but  130 ms:

Code: Pascal  [Select][+][-]
  1. uses simplehtmltreeparser;
  2. procedure TForm1.Button1Click(Sender: TObject);
  3.   procedure ProcessNode(ANode: TTreeNode);
  4.   var
  5.     Value: String;
  6.   begin
  7.     Value := Trim(String(ANode.getStringValue()));
  8.     if Value = '' then Exit;
  9.  
  10.     {$IFDEF STRINGLIST}
  11.     FOutputList.Add(ANode.value+ '=' + Value);
  12.     {$ELSE}
  13.     with ListViewDoc.Items.Add do
  14.     begin
  15.       Caption := ANode.ClassName;
  16.       SubItems.Add(String(ANode.NodeName));
  17.       SubItems.Add(Value);
  18.     end;
  19.     {$ENDIF}
  20.   end;
  21.  
  22.   procedure ProcessElement(AElement: TTreeNode);
  23.   var
  24.     a: TTreeAttribute;
  25.     n: TTreeNode;
  26.   begin
  27.     {$IFDEF STRINGLIST}
  28.     FOutputList.Add(AElement.ClassName + '=' + AElement.value);
  29.     {$ELSE}
  30.     with ListViewDoc.Items.Add do
  31.     begin
  32.       Caption := AElement.ClassName;
  33.       SubItems.Add(String(AElement.TagName));
  34.     end;
  35.     {$ENDIF}
  36.  
  37.     for a in AElement.getEnumeratorAttributes do
  38.       ProcessNode(a)
  39.     ;
  40.  
  41.     for n in AElement.getEnumeratorChildren do
  42.       if n.typ = tetOpen then ProcessElement(n)
  43.       else ProcessNode(n);
  44.   end;
  45.  
  46. var
  47.   tp: TTreeParser;
  48.   AStart, AEnd: QWord;
  49.   t: String;
  50.   tnode: TTreeDocument;
  51. begin
  52.   t := memo1.Lines.Text;
  53.   {$IFDEF STRINGLIST}
  54.   FOutputList.BeginUpdate;
  55.   {$ELSE}
  56.   ListViewReader.BeginUpdate;
  57.   {$ENDIF}
  58.   try
  59.     {$IFDEF STRINGLIST}
  60.     FOutputList.Clear;
  61.     {$ELSE}
  62.     ListViewReader.Clear;
  63.     {$ENDIF}
  64.     AStart := GetTickCount64;
  65.  
  66.  
  67.     tp := TTreeParser.Create;
  68.     tp.parsingModel := pmHTML;
  69.     tnode := tp.parseTree(t, '', 'html');
  70.     ProcessElement(tnode);
  71.     //caption := query('count($_1//*)', [xqvalue(tnode)]).tostring;
  72.     //t := tnode.outerHTML();
  73.     tp.free;
  74.  
  75.  
  76.     AEnd := GetTickCount64;
  77.     StatusBar1.SimpleText := Format('TTreeParser (%d msec)', [AEnd - AStart]);
  78.   finally
  79.     {$IFDEF STRINGLISt}
  80.     FOutputList.EndUpdate;
  81.     DisplayOutput(ListViewDoc);
  82.     {$ELSE}
  83.     ListViewDoc.EndUpdate;
  84.     {$ENDIF}
  85.   end;
  86.  
  87. end;
  88.  


 

TinyPortal © 2005-2018