Recent

Author Topic: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]"  (Read 2339 times)

Mig.BR

  • New Member
  • *
  • Posts: 12
Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.

Code: Pascal  [Select][+][-]
  1. var sRetText, sIdHTTPText: String;
  2.     Doc: THTMLDocument;
  3.     XPathResult: TXPathVariable;
  4.     Stream: TStringStream;
  5. begin
  6.    ...
  7.    try
  8.       Stream := TStringStream.Create(sIdHTTPText);
  9.       ReadHTMLFile(Doc, Stream);
  10.       XPathResult := EvaluateXPathExpression('/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]', Doc.DocumentElement);
  11.       sRetText := String(XPathResult.AsText);
  12.    finally
  13.       XPathResult.Free;
  14.       Doc.Free;
  15.       Stream.Free;
  16.    end;
  17.    ...
  18. end;
  19.  
« Last Edit: August 20, 2022, 08:24:51 am by Mig.BR »

dje

  • Full Member
  • ***
  • Posts: 134
sax_html is a subset/minimal html parser. ie: any unknown tags are ignored and not added to the document.
You can use something like this to show a THTMLDocument structure in a TTreeView:

Code: Pascal  [Select][+][-]
  1. procedure TreeViewAddDOMNode(ATreeView: TTreeView; AParent: TTreeNode; ADOMNode: TDOMNode);
  2. var
  3.   LIndex: integer;
  4. begin
  5.   AParent := ATreeView.Items.AddChild(AParent, ADOMNode.NodeName + ' = ' + Trim(ADOMNode.TextContent));
  6.   for LIndex := 0 to ADOMNode.ChildNodes.Count - 1 do begin
  7.     TreeViewAddDOMNode(ATreeView, AParent, ADOMNode.ChildNodes[LIndex]);
  8.   end;
  9. end;

Call the code with:
Code: Pascal  [Select][+][-]
  1. TreeViewAddDOMNode(Form1.TreeView1, nil, LHTMLDocument.DocumentElement);  

You should see nodes missing and the nodes available.

Using a XPath for the subset document tree does work, although it might not be the answer you want.
« Last Edit: July 03, 2022, 04:03:05 am by derek.john.evans »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.

Do you have an example of the HTML in question? Can you provide a full example of what you tried with EvaluateXPathExpression? Does it work correctly if you convert your example HTML to XML?

Mig.BR

  • New Member
  • *
  • Posts: 12
First I want to thank everyone who tried to help.

Initially I had no return due to the fact that the HTML came with only the content inside the <body>...</body> and not the entire HTML.

My main problem was extracting text with line breaks <br/> that where suppressed and with special characters on some sites.

derek.john.evans, I liked the tree idea for exploring the content of an HTML.

PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.

Below is the code: It might be useful for someone. :D

Code: Pascal  [Select][+][-]
  1. function ExtractHTMLXPath(sHTML, sXPath: String): String;
  2. var vlDoc: THTMLDocument;
  3.     vlXPathResult: TXPathVariable;
  4.     vlStream: TStringStream;
  5. begin
  6.   Result := '';
  7.   try
  8.      sHTML := StringReplace(sHTML, '<br/>', '<br/>'+LineEnding, [rfReplaceAll]);
  9.      vlStream := TStringStream.Create(sHTML);
  10.      ReadHTMLFile(vlDoc, vlStream);
  11.      vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);
  12.      Result := UTF16ToUTF8(vlXPathResult.AsText);
  13.   finally
  14.      vlXPathResult.Free;
  15.      vlDoc.Free;
  16.      vlStream.Free;
  17.   end;
  18. end;
  19.  

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.

So to clarify: it's working correctly now for you?

Below is the code: It might be useful for someone. :D

Code: Pascal  [Select][+][-]
  1. function ExtractHTMLXPath(sHTML, sXPath: String): String;
  2. var vlDoc: THTMLDocument;
  3.     vlXPathResult: TXPathVariable;
  4.     vlStream: TStringStream;
  5. begin
  6.   Result := '';
  7.   try
  8.      sHTML := StringReplace(sHTML, '<br/>', '<br/>'+LineEnding, [rfReplaceAll]);
  9.      vlStream := TStringStream.Create(sHTML);
  10.      ReadHTMLFile(vlDoc, vlStream);
  11.      vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);
  12.      Result := UTF16ToUTF8(vlXPathResult.AsText);
  13.   finally
  14.      vlXPathResult.Free;
  15.      vlDoc.Free;
  16.      vlStream.Free;
  17.   end;
  18. end;
  19.  

You should initialize vlXPathResult, vlDoc and vlStream to Nil at the start of the function as otherwise if e.g. ReadHTMLFile fails with an exception vlXPathResult will (still) contain garbage and thus lead to another exception when calling vlXPathResult.Free.

Mig.BR

  • New Member
  • *
  • Posts: 12
Yes, PascalDragon, that way worked perfectly.

I added another StringReplace to handle the line breaks <br /> (with space between <br and />).

Thanks for the tip to initialize the classes with nil, I would only run into that problem the moment this would return an exception.
I also liked the tip on converting HTML to XML. I will test that one too as it may come in handy in the future.

Thanks again for your help.

Mig.BR

  • New Member
  • *
  • Posts: 12
So to clarify: it's working correctly now for you?

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:
Code: HTML5  [Select][+][-]
  1. <!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...

It has no </html>, </body> or </head>.  How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?
« Last Edit: July 07, 2022, 07:30:39 am by Miguel.BR »

PascalDragon

  • Hero Member
  • *****
  • Posts: 5444
  • Compiler Developer
So to clarify: it's working correctly now for you?

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:
Code: HTML5  [Select][+][-]
  1. <!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...

It has no </html>, </body> or </head>.  How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?

If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

Mig.BR

  • New Member
  • *
  • Posts: 12
If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

Here is an example of HTML.
« Last Edit: July 09, 2022, 04:35:01 am by Miguel.BR »

Mig.BR

  • New Member
  • *
  • Posts: 12
I apologize but lack of time prevented me from returning to the topic. After a quick analysis of the xhtml and some tests I realized that just adding the </html> tag to the end of the text would solve the problem. In this case, we need to adapt the XPath because if it doesn't have the <body> tag and others we must also remove then from the path.
I did not analyze deeply the code of ReadHTMLFile() from SAX_HTML because I considered it a little complex but I think it would be useful to  verify if the only <html> tag really has its closure.
« Last Edit: August 20, 2022, 08:34:55 am by Mig.BR »

 

TinyPortal © 2005-2018