Forum > General

Extract XPath from HTML in the format "/html/body/div[1]/main/div[2]"

<< < (2/2)

Mig.BR:
Yes, PascalDragon, that way worked perfectly.

I added another StringReplace to handle the line breaks <br /> (with space between <br and />).

Thanks for the tip to initialize the classes with nil, I would only run into that problem the moment this would return an exception.
I also liked the tip on converting HTML to XML. I will test that one too as it may come in handy in the future.

Thanks again for your help.

Mig.BR:

--- Quote from: PascalDragon on July 04, 2022, 01:31:35 pm ---So to clarify: it's working correctly now for you?

--- End quote ---

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:

--- Code: HTML5  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---<!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...
It has no </html>, </body> or </head>.  How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?

PascalDragon:

--- Quote from: Miguel.BR on July 07, 2022, 06:58:05 am ---
--- Quote from: PascalDragon on July 04, 2022, 01:31:35 pm ---So to clarify: it's working correctly now for you?

--- End quote ---

It seems that my happiness was short-lived. I just came across my first XHTML and again I didn't get any return. I tried the most diverse modifications, but all without result.

Apparently EvaluateXPathExpression can't extract anything of this type. I even tried the XHTML unit but it seems incomplete.

Any idea how to use EvaluateXPathExpression to extract from XHTMLs or is it not implemented yet?

e.g.:

--- Code: HTML5  [+][-]window.onload = function(){var x1 = document.getElementById("main_content_section"); if (x1) { var x = document.getElementsByClassName("geshi");for (var i = 0; i < x.length; i++) { x[i].style.maxHeight='none'; x[i].style.height = Math.min(x[i].clientHeight+15,306)+'px'; x[i].style.resize = "vertical";}};} ---<!doctype html><html lang=pt xmlns=http://www.w3.org/1999/xhtml><meta charset=utf-8><meta name=language content=pt-br>...
It has no </html>, </body> or </head>.  How could I convert this to HTML or XML?
Validator.w3.org found 26 errors on this page. Is there a way around this or just brute force?

--- End quote ---

If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

Mig.BR:

--- Quote from: PascalDragon on July 07, 2022, 01:50:17 pm ---If it isn't a valid XHTML then XHTML functions must not treat it as valid XHTML. Do you have a full example of such a file?

--- End quote ---

Here is an example of HTML.

Mig.BR:
I apologize but lack of time prevented me from returning to the topic. After a quick analysis of the xhtml and some tests I realized that just adding the </html> tag to the end of the text would solve the problem. In this case, we need to adapt the XPath because if it doesn't have the <body> tag and others we must also remove then from the path.
I did not analyze deeply the code of ReadHTMLFile() from SAX_HTML because I considered it a little complex but I think it would be useful to  verify if the only <html> tag really has its closure.

Navigation

[0] Message Index

[*] Previous page

Go to full version