Forum > General

Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

(1/2) > >>

Mig.BR:
Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---var sRetText, sIdHTTPText: String;    Doc: THTMLDocument;    XPathResult: TXPathVariable;    Stream: TStringStream;begin   ...   try      Stream := TStringStream.Create(sIdHTTPText);       ReadHTMLFile(Doc, Stream);      XPathResult := EvaluateXPathExpression('/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]', Doc.DocumentElement);      sRetText := String(XPathResult.AsText);   finally      XPathResult.Free;      Doc.Free;      Stream.Free;   end;   ...end; 

derek.john.evans:
sax_html is a subset/minimal html parser. ie: any unknown tags are ignored and not added to the document.
You can use something like this to show a THTMLDocument structure in a TTreeView:


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---procedure TreeViewAddDOMNode(ATreeView: TTreeView; AParent: TTreeNode; ADOMNode: TDOMNode);var  LIndex: integer;begin  AParent := ATreeView.Items.AddChild(AParent, ADOMNode.NodeName + ' = ' + Trim(ADOMNode.TextContent));  for LIndex := 0 to ADOMNode.ChildNodes.Count - 1 do begin    TreeViewAddDOMNode(ATreeView, AParent, ADOMNode.ChildNodes[LIndex]);  end;end;
Call the code with:

--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---TreeViewAddDOMNode(Form1.TreeView1, nil, LHTMLDocument.DocumentElement);  
You should see nodes missing and the nodes available.

Using a XPath for the subset document tree does work, although it might not be the answer you want.

PascalDragon:

--- Quote from: Miguel.BR on July 03, 2022, 01:38:07 am ---Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.
--- End quote ---

Do you have an example of the HTML in question? Can you provide a full example of what you tried with EvaluateXPathExpression? Does it work correctly if you convert your example HTML to XML?

Mig.BR:
First I want to thank everyone who tried to help.

Initially I had no return due to the fact that the HTML came with only the content inside the <body>...</body> and not the entire HTML.

My main problem was extracting text with line breaks <br/> that where suppressed and with special characters on some sites.

derek.john.evans, I liked the tree idea for exploring the content of an HTML.

PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.

Below is the code: It might be useful for someone. :D


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function ExtractHTMLXPath(sHTML, sXPath: String): String;var vlDoc: THTMLDocument;    vlXPathResult: TXPathVariable;    vlStream: TStringStream;begin  Result := '';  try     sHTML := StringReplace(sHTML, '<br/>', '<br/>'+LineEnding, [rfReplaceAll]);     vlStream := TStringStream.Create(sHTML);     ReadHTMLFile(vlDoc, vlStream);     vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);     Result := UTF16ToUTF8(vlXPathResult.AsText);  finally     vlXPathResult.Free;     vlDoc.Free;     vlStream.Free;  end;end; 

PascalDragon:

--- Quote from: Miguel.BR on July 04, 2022, 01:31:31 am ---PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.
--- End quote ---

So to clarify: it's working correctly now for you?


--- Quote from: Miguel.BR on July 04, 2022, 01:31:31 am ---Below is the code: It might be useful for someone. :D


--- Code: Pascal  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---function ExtractHTMLXPath(sHTML, sXPath: String): String;var vlDoc: THTMLDocument;    vlXPathResult: TXPathVariable;    vlStream: TStringStream;begin  Result := '';  try     sHTML := StringReplace(sHTML, '<br/>', '<br/>'+LineEnding, [rfReplaceAll]);     vlStream := TStringStream.Create(sHTML);     ReadHTMLFile(vlDoc, vlStream);     vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);     Result := UTF16ToUTF8(vlXPathResult.AsText);  finally     vlXPathResult.Free;     vlDoc.Free;     vlStream.Free;  end;end; 
--- End quote ---

You should initialize vlXPathResult, vlDoc and vlStream to Nil at the start of the function as otherwise if e.g. ReadHTMLFile fails with an exception vlXPathResult will (still) contain garbage and thus lead to another exception when calling vlXPathResult.Free.

Navigation

[0] Message Index

[#] Next page

Go to full version