Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]"

Mig.BR

New Member
Posts: 12

Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]"

« on: July 03, 2022, 01:38:07 am »

Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.

Code: Pascal [Select][+]

var sRetText, sIdHTTPText: String;
    Doc: THTMLDocument;
    XPathResult: TXPathVariable;
    Stream: TStringStream;
begin
   ...
   try
      Stream := TStringStream.Create(sIdHTTPText); 
      ReadHTMLFile(Doc, Stream);
      XPathResult := EvaluateXPathExpression('/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]', Doc.DocumentElement);
      sRetText := String(XPathResult.AsText);
   finally
      XPathResult.Free;
      Doc.Free;
      Stream.Free;
   end;
   ...
end;
 

« Last Edit: August 20, 2022, 08:24:51 am by Mig.BR »

Logged

dje

Full Member
Posts: 134

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #1 on: July 03, 2022, 02:53:23 am »

sax_html is a subset/minimal html parser. ie: any unknown tags are ignored and not added to the document.
You can use something like this to show a THTMLDocument structure in a TTreeView:

Code: Pascal [Select][+]

procedure TreeViewAddDOMNode(ATreeView: TTreeView; AParent: TTreeNode; ADOMNode: TDOMNode);
var
  LIndex: integer;
begin
  AParent := ATreeView.Items.AddChild(AParent, ADOMNode.NodeName + ' = ' + Trim(ADOMNode.TextContent));
  for LIndex := 0 to ADOMNode.ChildNodes.Count - 1 do begin
    TreeViewAddDOMNode(ATreeView, AParent, ADOMNode.ChildNodes[LIndex]);
  end;
end; 

Call the code with:

Code: Pascal [Select][+]

TreeViewAddDOMNode(Form1.TreeView1, nil, LHTMLDocument.DocumentElement);  

You should see nodes missing and the nodes available.

Using a XPath for the subset document tree does work, although it might not be the answer you want.

« Last Edit: July 03, 2022, 04:03:05 am by derek.john.evans »

Logged

PascalDragon

Hero Member
Posts: 5469
Compiler Developer

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #2 on: July 03, 2022, 02:21:34 pm »

Quote from: Miguel.BR on July 03, 2022, 01:38:07 am

Is there any way native to FPC (without third party libraries) to extract the contents of an HTML based on the Full XPath returned by browsers?

E.g.:
/html/body/div[1]/main/div[2]/div[2]/div[2]/div/div[1]

I tested many ways found on the forum like EvaluateXPathExpression but could not get it to work for this XPath format.

I found a lot of documentation for XML but nothing that works properly for HTML.

Do you have an example of the HTML in question? Can you provide a full example of what you tried with EvaluateXPathExpression? Does it work correctly if you convert your example HTML to XML?

Logged

Mig.BR

New Member
Posts: 12

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #3 on: July 04, 2022, 01:31:31 am »

First I want to thank everyone who tried to help.

Initially I had no return due to the fact that the HTML came with only the content inside the <body>...</body> and not the entire HTML.

My main problem was extracting text with line breaks that where suppressed and with special characters on some sites.

derek.john.evans, I liked the tree idea for exploring the content of an HTML.

PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.

Below is the code: It might be useful for someone.

Code: Pascal [Select][+]

function ExtractHTMLXPath(sHTML, sXPath: String): String;
var vlDoc: THTMLDocument;
    vlXPathResult: TXPathVariable;
    vlStream: TStringStream;
begin
  Result := '';
  try
     sHTML := StringReplace(sHTML, '<br/>', '<br/>'+LineEnding, [rfReplaceAll]);
     vlStream := TStringStream.Create(sHTML);
     ReadHTMLFile(vlDoc, vlStream);
     vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);
     Result := UTF16ToUTF8(vlXPathResult.AsText);
  finally
     vlXPathResult.Free;
     vlDoc.Free;
     vlStream.Free;
  end;
end;
 

Logged

PascalDragon

Hero Member
Posts: 5469
Compiler Developer

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #4 on: July 04, 2022, 01:31:35 pm »

Quote from: Miguel.BR on July 04, 2022, 01:31:31 am

PascalDragon, your request for an HTML example made me realize that in another part of the application I was clipping part of the HTML content and was only parsing the content inside the <body>. For this reason I always got null returns.

So to clarify: it's working correctly now for you?

Quote from: Miguel.BR on July 04, 2022, 01:31:31 am

Below is the code: It might be useful for someone.

Code: Pascal [Select][+][-]
function ExtractHTMLXPath(sHTML, sXPath: String): String;
var vlDoc: THTMLDocument;
vlXPathResult: TXPathVariable;
vlStream: TStringStream;
begin
Result := '';
try
sHTML := StringReplace(sHTML, ' ', ' '+LineEnding, [rfReplaceAll]);
vlStream := TStringStream.Create(sHTML);
ReadHTMLFile(vlDoc, vlStream);
vlXPathResult := EvaluateXPathExpression(UTF8ToUTF16(sXPath), vlDoc.DocumentElement);
Result := UTF16ToUTF8(vlXPathResult.AsText);
finally
vlXPathResult.Free;
vlDoc.Free;
vlStream.Free;
end;
end;

You should initialize vlXPathResult, vlDoc and vlStream to Nil at the start of the function as otherwise if e.g. ReadHTMLFile fails with an exception vlXPathResult will (still) contain garbage and thus lead to another exception when calling vlXPathResult.Free.

Logged

Mig.BR

New Member
Posts: 12

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #5 on: July 04, 2022, 07:57:01 pm »

Yes, PascalDragon, that way worked perfectly.

I added another StringReplace to handle the line breaks (with space between ).

Thanks for the tip to initialize the classes with nil, I would only run into that problem the moment this would return an exception.
I also liked the tip on converting HTML to XML. I will test that one too as it may come in handy in the future.

Thanks again for your help.

Logged

Mig.BR

New Member
Posts: 12

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #6 on: July 07, 2022, 06:58:05 am »

Quote from: PascalDragon on July 04, 2022, 01:31:35 pm

So to clarify: it's working correctly now for you?

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:

Code: HTML5 [Select][+]

<!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...

It has no </html>, </body> or </head>. How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?

« Last Edit: July 07, 2022, 07:30:39 am by Miguel.BR »

Logged

PascalDragon

Hero Member
Posts: 5469
Compiler Developer

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #7 on: July 07, 2022, 01:50:17 pm »

Quote from: Miguel.BR on July 07, 2022, 06:58:05 am

Quote from: PascalDragon on July 04, 2022, 01:31:35 pm
So to clarify: it's working correctly now for you?

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:
Code: HTML5 [Select][+][-]
<!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...

It has no </html>, </body> or </head>. How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?

If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

Logged

Mig.BR

New Member
Posts: 12

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #8 on: July 07, 2022, 11:07:29 pm »

Quote from: PascalDragon on July 07, 2022, 01:50:17 pm

If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

Here is an example of HTML.

page.zip (7.83 kB - downloaded 31 times.)

« Last Edit: July 09, 2022, 04:35:01 am by Miguel.BR »

Logged

Mig.BR

New Member
Posts: 12

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

« Reply #9 on: August 20, 2022, 08:17:49 am »

I apologize but lack of time prevented me from returning to the topic. After a quick analysis of the xhtml and some tests I realized that just adding the </html> tag to the end of the text would solve the problem. In this case, we need to adapt the XPath because if it doesn't have the <body> tag and others we must also remove then from the path.
I did not analyze deeply the code of ReadHTMLFile() from SAX_HTML because I considered it a little complex but I think it would be useful to verify if the only <html> tag really has its closure.

« Last Edit: August 20, 2022, 08:34:55 am by Mig.BR »

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]" (Read 2488 times)

Mig.BR

Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]"

dje

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

PascalDragon

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

Mig.BR

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

PascalDragon

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

Mig.BR

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

Mig.BR

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

PascalDragon

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

Mig.BR

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

Mig.BR

Re: Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]/div[2]"

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook