Recent

Author Topic: fasthtmlparser and empty tag  (Read 2292 times)

krolikbest

  • Full Member
  • ***
  • Posts: 246
fasthtmlparser and empty tag
« on: July 01, 2020, 09:31:19 am »
Hi

I'm using fasthtmlparser in order to get tag's value. lets say:
<td>qwerty</td> gives me "qwerty".  This is ok,  but when the tag is empty <td></td>, even though  I discover this tag in OnTag procedure but OnText procedure gives nothing. This is correct probably because tag is empty but how to discover that tag is empty and then set some info about it? If I have table with few cells and for example 2 or more are empty then after receiving values in OnText procedure have no idea which cells are empty and which not. Or something i'm missing..

Bart

  • Hero Member
  • *****
  • Posts: 5288
    • Bart en Mariska's Webstek
Re: fasthtmlparser and empty tag
« Reply #1 on: July 01, 2020, 10:11:45 am »
You can take a look at my SimpleHtmlTableParser unit.

Bart

TRon

  • Hero Member
  • *****
  • Posts: 2496
Re: fasthtmlparser and empty tag
« Reply #2 on: July 01, 2020, 10:25:17 am »
Other than Bart's solution, it is a bit difficult to answer because it depends on how and what you parse.

For instance, below code shows to routes that you could take:
Code: Pascal  [Select][+][-]
  1. program parse;
  2.  
  3. {$MODE OBJFPC}{$H+}
  4.  
  5. uses
  6.   sysutils, fasthtmlparser;
  7.  
  8. type
  9.   TObjectsEvents = object
  10.     LastText : string;
  11.     TableRow: Array of String;
  12.     procedure DoTag(NoCaseTag, ActualTag: string);
  13.     procedure DoText(txt: String);
  14.   end;
  15.  
  16. procedure TObjectsEvents.DoTag(NoCaseTag, ActualTag: string);
  17. begin
  18.   WriteLn('DoTag:', ActualTag);
  19.   case NoCaseTag of
  20.     '<TD>'  :
  21.     begin
  22.      LastText := '';
  23.      SetLength(TableRow, Length(TableRow) + 1);
  24.     end;
  25.     '</TD>' : { use LastText, can be empty };
  26.   end;
  27. end;
  28.  
  29. procedure TObjectsEvents.DoText(txt: String);
  30. begin
  31.   WriteLn('DoText:', txt);
  32.   // method 1
  33.   LastText := txt;
  34.   // method 2
  35.   TableRow[High(TableRow)] := txt;
  36. end;
  37.  
  38. procedure parsing;
  39. var
  40.   html : THTMLParser;
  41.   oe   : TObjectsEvents;
  42.   celldata : string;
  43. begin
  44.   html := THTMLParser.Create('<td>hello</td><td></td><td>goodbye</td>');
  45.   html.OnFoundTag := @oe.DoTag;
  46.   html.OnFoundText := @oe.DoText;
  47.   html.Exec;
  48.   html.Free;
  49.   for celldata in oe.TableRow
  50.     do  WriteLn(celldata.QuotedString);
  51. end;
  52.  
  53. begin
  54.   parsing;
  55. end.
  56.  
Which outputs:
Code: [Select]
DoTag:<td>
DoText:hello
DoTag:</td>
DoTag:<td>
DoTag:</td>
DoTag:<td>
DoText:goodbye
DoTag:</td>
'hello'
''
'goodbye'

But this works only for this particular type of parsing. I used a somewhat similar approach as Bart in my own table 'ripper' in that I was able to select which tables I was interested in (counting number of table starts/ends and only parse those parse that are present in a list of user provided indexes) and then parsing the table in a somewhat similar manner as Bart seems to do.

On method I also always use when using fasthtmlparser is to keep track on the (complete) path,eg. HTML/BODY/DIV/TABLE/TH/TR/TD with pushing and popping the pathstring at OnTagFound.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11444
  • FPC developer.
Re: fasthtmlparser and empty tag
« Reply #3 on: July 01, 2020, 10:45:06 am »
I think you need to count the td tags anyway. text might also be triggered multiple times inside a <td></td> if there is formatting?

<td>bla<b>bla2</b></td>

I had such issues too, but I mostly worked from the DOM tree, and then simply walked the td's, and converted the whole subtree back to html.


krolikbest

  • Full Member
  • ***
  • Posts: 246
Re: fasthtmlparser and empty tag
« Reply #4 on: July 01, 2020, 10:48:05 am »
Probably one way or the other - indexation I think too.

wp

  • Hero Member
  • *****
  • Posts: 11906
Re: fasthtmlparser and empty tag
« Reply #5 on: July 01, 2020, 10:53:20 am »
I'm using fasthtmlparser in order to get tag's value. lets say:
<td>qwerty</td> gives me "qwerty".  This is ok,  but when the tag is empty <td></td>, even though  I discover this tag in OnTag procedure but OnText procedure gives nothing. This is correct probably because tag is empty but how to discover that tag is empty and then set some info about it?
The parser runs through the html text and collects the strings between '<' and '>' and between '>' and '<', and fires an OnFoundTag event for the text found in the former and and OnFoundText event for the latter case. The tag string ('<td>' or '</td>' in your example) is passed  to the OnFoundTag event as parameter ActualTag and NoCaseTag (after upper-casing). And the text is passed to the OnFoundText event as parameter AText. So, when you want to detect wheter a tag is empty just check for "if AText='' then ....' in the OnFoundText event. But be careful: empty tags can occur in other cases as well. Therefore, you always must store the context that you can obtain from the OnFoundTag event.

Maybe you should have a look at my attached demo (the demo file taken from https://www.w3schools.com/html/html_tables.asp does not contain empty cells, but you can edit the SynEdit in any way you like).

[EDIT]
@TRon: Sorry, you posted while I was typing... Your code follows the same idea as mine.
« Last Edit: July 01, 2020, 04:01:54 pm by wp »

jamie

  • Hero Member
  • *****
  • Posts: 6128
Re: fasthtmlparser and empty tag
« Reply #6 on: July 01, 2020, 12:36:48 pm »
although not directly related this is where I like using a PCHAR return that isn't a product of a managed string.

 If the field does not exist then its NIL, if it does but is empty its not nil and points to a field. If the field was valid but nothing in it then it will  point to a NULL char within that field at the start.

 Casting with managed strings in this style will cause the compiler to generate a valid string but zero content. So its very confusing at times to attempt to use managed strings this way where by you can test for three conditions in a single return, Does not exist, Exist but Empty or just Exist and have content.

The only true wisdom is knowing you know nothing

krolikbest

  • Full Member
  • ***
  • Posts: 246
Re: fasthtmlparser and empty tag
« Reply #7 on: July 01, 2020, 01:10:48 pm »
So I have attached rewritten slightly code. Yesterday it was my first touch and wasn't sure if I do something wrong or my wrong approach. Idea of counting is right. Still problem what marcov mentioned if tag is complex (<td>..<p><br>....</p></td>) but for my need simple <td></td> (I generate values in Arduino and show them as simply as possible on the webserver built in Arduino) it fits as is.

TRon

  • Hero Member
  • *****
  • Posts: 2496
Re: fasthtmlparser and empty tag
« Reply #8 on: July 01, 2020, 02:10:48 pm »
@wp: no problem at all. Usually I am the one taking 45 minutes on an edit, which then ends up in a *duh* that was already mentioned by someone quicker than me :-)

@krolikbest:
Wait until you meet nested tables  ;D

What you can do, in case you fairly certain there are not such exotic tags inside a table's datacell is to keep track whether you are or aren't inside a TD tag (use the OnFoundTag, set InCellDataTag to true when you find <TD> end false if </TD>.

When InCellDataTag is true you ignore tags that you wish to ignore, and keep adding text to the current/active text untill you meet the ending tag (and then collect the text-data)

Now, as you might have been able to see yourself, some tags inside a cell actually makes sense, such as <BR> (replace that with LineFeed) a <P> (for example use two linefeeds for that). If you are converting the celldata text for example to rtf then you can replace color tags, italic, bold etc to rtf codes, or translate into markdown, or simply ignore those tags.

Te problem is, as with all things proposed is that it totally depends on the data you are parsing. If things get too complex you are usually better off using a domparser (using xpath) or use internettools (which has a somewhat more modern xpath parser/selector).

And we did not even mentioned faulty html-code... in case you do not have the experience: there are more html pages around that don't follow the (w3) rules then there are pages that actually do. Missing tags, misuse of tags, scripts (in whatever language, js, pascal, json, php, python, etc), comments. In short: a lot of things can seriously confuse/damage a simple parser.

Bart

  • Hero Member
  • *****
  • Posts: 5288
    • Bart en Mariska's Webstek
Re: fasthtmlparser and empty tag
« Reply #9 on: July 01, 2020, 03:11:59 pm »
In my FarMedTools application (http://svn.code.sf.net/p/flyingsheep/code/trunk/FarmedTools/) you can find the code I use to parse a Html table and store the result in a TStringTable (http://svn.code.sf.net/p/flyingsheep/code/trunk/MijnLib/stringtable.pp).
It only works for simple tables (no colspan or rowspan, no table inside a table).

In the sourcecode there is also sample HTML, so you can see what kind of html it is supposed to be able to parse.

Feel free to use, it's LGPL with linking exception (like LCL).

Bart

wp

  • Hero Member
  • *****
  • Posts: 11906
Re: fasthtmlparser and empty tag
« Reply #10 on: July 01, 2020, 04:13:52 pm »
Usually I am the one taking 45 minutes on an edit, which then ends up in a *duh* that was already mentioned by someone quicker than me :-)
Nevertheless careful reviewing of a post is a very admirable habit -- having seen the large number of messages where the posters did not even notice that they did not close a [code] or [quote] tag and the entire text is in the wrong context...

krolikbest

  • Full Member
  • ***
  • Posts: 246
Re: fasthtmlparser and empty tag
« Reply #11 on: July 01, 2020, 05:47:29 pm »
@krolikbest:
Wait until you meet nested tables  ;D

What you can do, in case you fairly certain there are not such exotic tags inside a table's datacell is to keep track whether you are or aren't inside a TD tag (use the OnFoundTag, set InCellDataTag to true when you find <TD> end false if </TD>.

When InCellDataTag is true you ignore tags that you wish to ignore, and keep adding text to the current/active text untill you meet the ending tag (and then collect the text-data)
[/quote]

I do exactly that. The problem was (but not now) with empty tags.  Luckily I generate tags on webserver side too so I do care about tags to be as simply as possible :)

 

TinyPortal © 2005-2018