fasthtmlparser and empty tag

krolikbest

Full Member
Posts: 246

Hi

I'm using fasthtmlparser in order to get tag's value. lets say:
<td>qwerty</td> gives me "qwerty". This is ok, but when the tag is empty <td></td>, even though I discover this tag in OnTag procedure but OnText procedure gives nothing. This is correct probably because tag is empty but how to discover that tag is empty and then set some info about it? If I have table with few cells and for example 2 or more are empty then after receiving values in OnText procedure have no idea which cells are empty and which not. Or something i'm missing..

Logged

Bart

Hero Member
Posts: 5288

Re: fasthtmlparser and empty tag

« Reply #1 on: July 01, 2020, 10:11:45 am »

You can take a look at my SimpleHtmlTableParser unit.

Bart

Logged

TRon

Hero Member
Posts: 2496

Re: fasthtmlparser and empty tag

« Reply #2 on: July 01, 2020, 10:25:17 am »

Other than Bart's solution, it is a bit difficult to answer because it depends on how and what you parse.

For instance, below code shows to routes that you could take:

Code: Pascal [Select][+]

program parse;
 
{$MODE OBJFPC}{$H+}
 
uses
  sysutils, fasthtmlparser;
 
type
  TObjectsEvents = object
    LastText : string;
    TableRow: Array of String;
    procedure DoTag(NoCaseTag, ActualTag: string);
    procedure DoText(txt: String);
  end;
 
procedure TObjectsEvents.DoTag(NoCaseTag, ActualTag: string);
begin
  WriteLn('DoTag:', ActualTag);
  case NoCaseTag of
    '<TD>'  : 
    begin
     LastText := '';
     SetLength(TableRow, Length(TableRow) + 1);
    end;
    '</TD>' : { use LastText, can be empty };
  end;
end;
 
procedure TObjectsEvents.DoText(txt: String);
begin
  WriteLn('DoText:', txt);
  // method 1
  LastText := txt;
  // method 2
  TableRow[High(TableRow)] := txt;
end;
 
procedure parsing;
var
  html : THTMLParser;
  oe   : TObjectsEvents;
  celldata : string;
begin
  html := THTMLParser.Create('<td>hello</td><td></td><td>goodbye</td>');
  html.OnFoundTag := @oe.DoTag;
  html.OnFoundText := @oe.DoText;
  html.Exec;
  html.Free;
  for celldata in oe.TableRow 
    do  WriteLn(celldata.QuotedString);
end;
 
begin
  parsing;
end.
 

Which outputs:

Code: [Select]

DoTag:<td>
DoText:hello
DoTag:</td>
DoTag:<td>
DoTag:</td>
DoTag:<td>
DoText:goodbye
DoTag:</td>
'hello'
''
'goodbye'

But this works only for this particular type of parsing. I used a somewhat similar approach as Bart in my own table 'ripper' in that I was able to select which tables I was interested in (counting number of table starts/ends and only parse those parse that are present in a list of user provided indexes) and then parsing the table in a somewhat similar manner as Bart seems to do.

On method I also always use when using fasthtmlparser is to keep track on the (complete) path,eg. HTML/BODY/DIV/TABLE/TH/TR/TD with pushing and popping the pathstring at OnTagFound.

Logged

marcov

Administrator
Hero Member
Posts: 11444
FPC developer.

Re: fasthtmlparser and empty tag

« Reply #3 on: July 01, 2020, 10:45:06 am »

I think you need to count the td tags anyway. text might also be triggered multiple times inside a <td></td> if there is formatting?

<td>blabla2</td>

I had such issues too, but I mostly worked from the DOM tree, and then simply walked the td's, and converted the whole subtree back to html.

Logged

krolikbest

Full Member
Posts: 246

Re: fasthtmlparser and empty tag

« Reply #4 on: July 01, 2020, 10:48:05 am »

Probably one way or the other - indexation I think too.

Logged

wp

Hero Member
Posts: 11906

Re: fasthtmlparser and empty tag

« Reply #5 on: July 01, 2020, 10:53:20 am »

Quote from: krolikbest on July 01, 2020, 09:31:19 am

I'm using fasthtmlparser in order to get tag's value. lets say:
<td>qwerty</td> gives me "qwerty". This is ok, but when the tag is empty <td></td>, even though I discover this tag in OnTag procedure but OnText procedure gives nothing. This is correct probably because tag is empty but how to discover that tag is empty and then set some info about it?

The parser runs through the html text and collects the strings between '<' and '>' and between '>' and '<', and fires an OnFoundTag event for the text found in the former and and OnFoundText event for the latter case. The tag string ('<td>' or '</td>' in your example) is passed to the OnFoundTag event as parameter ActualTag and NoCaseTag (after upper-casing). And the text is passed to the OnFoundText event as parameter AText. So, when you want to detect wheter a tag is empty just check for "if AText='' then ....' in the OnFoundText event. But be careful: empty tags can occur in other cases as well. Therefore, you always must store the context that you can obtain from the OnFoundTag event.

Maybe you should have a look at my attached demo (the demo file taken from https://www.w3schools.com/html/html_tables.asp does not contain empty cells, but you can edit the SynEdit in any way you like).

[EDIT]
@TRon: Sorry, you posted while I was typing... Your code follows the same idea as mine.

fasthtmlparser_Tables.zip (4.96 kB - downloaded 60 times.)

« Last Edit: July 01, 2020, 04:01:54 pm by wp »

Logged

jamie

Hero Member
Posts: 6128

Re: fasthtmlparser and empty tag

« Reply #6 on: July 01, 2020, 12:36:48 pm »

although not directly related this is where I like using a PCHAR return that isn't a product of a managed string.

If the field does not exist then its NIL, if it does but is empty its not nil and points to a field. If the field was valid but nothing in it then it will point to a NULL char within that field at the start.

Casting with managed strings in this style will cause the compiler to generate a valid string but zero content. So its very confusing at times to attempt to use managed strings this way where by you can test for three conditions in a single return, Does not exist, Exist but Empty or just Exist and have content.

Logged

The only true wisdom is knowing you know nothing

krolikbest

Full Member
Posts: 246

Re: fasthtmlparser and empty tag

« Reply #7 on: July 01, 2020, 01:10:48 pm »

So I have attached rewritten slightly code. Yesterday it was my first touch and wasn't sure if I do something wrong or my wrong approach. Idea of counting is right. Still problem what marcov mentioned if tag is complex (<td>.. ....</td>) but for my need simple <td></td> (I generate values in Arduino and show them as simply as possible on the webserver built in Arduino) it fits as is.

TableExtractor.zip (5.72 kB - downloaded 65 times.)

Logged

TRon

Hero Member
Posts: 2496

Re: fasthtmlparser and empty tag

« Reply #8 on: July 01, 2020, 02:10:48 pm »

@wp: no problem at all. Usually I am the one taking 45 minutes on an edit, which then ends up in a *duh* that was already mentioned by someone quicker than me :-)

@krolikbest:
Wait until you meet nested tables

What you can do, in case you fairly certain there are not such exotic tags inside a table's datacell is to keep track whether you are or aren't inside a TD tag (use the OnFoundTag, set InCellDataTag to true when you find <TD> end false if </TD>.

When InCellDataTag is true you ignore tags that you wish to ignore, and keep adding text to the current/active text untill you meet the ending tag (and then collect the text-data)

Now, as you might have been able to see yourself, some tags inside a cell actually makes sense, such as (replace that with LineFeed) a (for example use two linefeeds for that). If you are converting the celldata text for example to rtf then you can replace color tags, italic, bold etc to rtf codes, or translate into markdown, or simply ignore those tags.

Te problem is, as with all things proposed is that it totally depends on the data you are parsing. If things get too complex you are usually better off using a domparser (using xpath) or use internettools (which has a somewhat more modern xpath parser/selector).

And we did not even mentioned faulty html-code... in case you do not have the experience: there are more html pages around that don't follow the (w3) rules then there are pages that actually do. Missing tags, misuse of tags, scripts (in whatever language, js, pascal, json, php, python, etc), comments. In short: a lot of things can seriously confuse/damage a simple parser.

Logged

Bart

Hero Member
Posts: 5288

Re: fasthtmlparser and empty tag

« Reply #9 on: July 01, 2020, 03:11:59 pm »

In my FarMedTools application (http://svn.code.sf.net/p/flyingsheep/code/trunk/FarmedTools/) you can find the code I use to parse a Html table and store the result in a TStringTable (http://svn.code.sf.net/p/flyingsheep/code/trunk/MijnLib/stringtable.pp).
It only works for simple tables (no colspan or rowspan, no table inside a table).

In the sourcecode there is also sample HTML, so you can see what kind of html it is supposed to be able to parse.

Feel free to use, it's LGPL with linking exception (like LCL).

Bart

Logged

wp

Hero Member
Posts: 11906

Re: fasthtmlparser and empty tag

« Reply #10 on: July 01, 2020, 04:13:52 pm »

Quote from: TRon on July 01, 2020, 02:10:48 pm

Usually I am the one taking 45 minutes on an edit, which then ends up in a *duh* that was already mentioned by someone quicker than me :-)

Nevertheless careful reviewing of a post is a very admirable habit -- having seen the large number of messages where the posters did not even notice that they did not close a [code] or [quote] tag and the entire text is in the wrong context...

Logged

krolikbest

Full Member
Posts: 246

Re: fasthtmlparser and empty tag

« Reply #11 on: July 01, 2020, 05:47:29 pm »

@krolikbest:
Wait until you meet nested tables

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: fasthtmlparser and empty tag (Read 2292 times)

krolikbest

fasthtmlparser and empty tag

Bart

Re: fasthtmlparser and empty tag

TRon

Re: fasthtmlparser and empty tag

marcov

Re: fasthtmlparser and empty tag

krolikbest

Re: fasthtmlparser and empty tag

wp

Re: fasthtmlparser and empty tag

jamie

Re: fasthtmlparser and empty tag

krolikbest

Re: fasthtmlparser and empty tag

TRon

Re: fasthtmlparser and empty tag

Bart

Re: fasthtmlparser and empty tag

wp

Re: fasthtmlparser and empty tag

krolikbest

Re: fasthtmlparser and empty tag

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook