Recent

Author Topic: SAX_HTML GetElementById not work  (Read 2765 times)

wytwyt02

  • Jr. Member
  • **
  • Posts: 83
SAX_HTML GetElementById not work
« on: June 06, 2020, 07:44:42 am »
for example, I have html following:

Code: Text  [Select][+][-]
  1. <!DOCTYPE html>
  2. <html lang="en">
  3. <head>
  4.         <meta charset="UTF-8">
  5.         <title>Document</title>
  6. </head>
  7. <body>
  8.         <div>
  9.                 <p id="test">test paragraph</p>
  10.         </div>
  11. </body>
  12. </html>

and parse with pascal:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMElement;
  5. begin
  6.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  7.   PElement := doc.GetElementById('test'); // PElement  is always nil
  8. end;
  9.  

But the PElement is always nil, I cannot get element by id, But if I using doc.GetElementsByTagName will works:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   InnerText: DOMString;
  5.   PElement: TDOMNodeList;
  6. begin
  7.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  8.   PElement := doc.GetElementsByTagName('p');
  9.   InnerText := PElement[0].TextContent;
  10.   DebugLn(InnerText); // It print 'test paragraph' to console, it mean works
  11. end;
« Last Edit: June 06, 2020, 07:46:49 am by wytwyt02 »

wytwyt02

  • Jr. Member
  • **
  • Posts: 83
Re: SAX_HTML GetElementById not work
« Reply #1 on: June 06, 2020, 10:03:20 am »
I can print the attribute name and value with:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMNodeList;
  5.   arrtibutes: TDOMNamedNodeMap;
  6.   i: Integer;
  7. begin
  8.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  9.   PElement := doc.GetElementsByTagName('p');
  10.   arrtibutes := PElement[0].Attributes;
  11.   for i := 0 to Pred(arrtibutes.Length) do
  12.   begin
  13.     DebugLn(arrtibutes[i].NodeName+ ':' + arrtibutes[i].NodeValue); // will print id:test
  14.   end;
  15. end;

But I still do not know how to GetElementById

dsiders

  • Hero Member
  • *****
  • Posts: 1052
Re: SAX_HTML GetElementById not work
« Reply #2 on: June 06, 2020, 11:50:59 am »
I can print the attribute name and value with:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMNodeList;
  5.   arrtibutes: TDOMNamedNodeMap;
  6.   i: Integer;
  7. begin
  8.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  9.   PElement := doc.GetElementsByTagName('p');
  10.   arrtibutes := PElement[0].Attributes;
  11.   for i := 0 to Pred(arrtibutes.Length) do
  12.   begin
  13.     DebugLn(arrtibutes[i].NodeName+ ':' + arrtibutes[i].NodeValue); // will print id:test
  14.   end;
  15. end;

But I still do not know how to GetElementById

Short answer:
You can't.

Longer answer:
IDs are maintained in a THashList that is never assigned in the ancestor (TDomDocument) or any of it's descendants. GetElementByID checks the hash list, and when not assigned always returns Nil.

Code: Pascal  [Select][+][-]
  1. function TDOMDocument.GetElementById(const ElementID: DOMString): TDOMElement;
  2. begin
  3.   Result := nil;
  4.   if Assigned(FIDList) then
  5.     Result := TDOMElement(FIDList.Get(DOMPChar(ElementID), Length(ElementID)));
  6. end;
  7.  
Preview Lazarus 3.99 documentation at: https://dsiders.gitlab.io/lazdocsnext

Jurassic Pork

  • Hero Member
  • *****
  • Posts: 1228
Re: SAX_HTML GetElementById not work
« Reply #3 on: June 06, 2020, 02:20:16 pm »
hello,
you can also use XPath  to get html element by id with the package internettools  :

Code: Pascal  [Select][+][-]
  1. program xPathTest;
  2. uses Classes, SysUtils, simpleinternet;
  3. var   extFile : TStringList;    
  4. begin
  5.    extFile := TStringList.Create();
  6.    extFile.LoadFromFile(GetCurrentDir + '\index.html');
  7.    writeln(process(extFile.text,'//p[@id="test"]').toString);  
  8.    extFile.Free;    
  9. end.
  10.  

Friendly J.P
« Last Edit: June 06, 2020, 02:23:59 pm by Jurassic Pork »
Jurassic computer : Sinclair ZX81 - Zilog Z80A à 3,25 MHz - RAM 1 Ko - ROM 8 Ko

wytwyt02

  • Jr. Member
  • **
  • Posts: 83
Re: SAX_HTML GetElementById not work
« Reply #4 on: June 06, 2020, 03:09:30 pm »
hello,
you can also use XPath  to get html element by id with the package internettools  :

Code: Pascal  [Select][+][-]
  1. program xPathTest;
  2. uses Classes, SysUtils, simpleinternet;
  3. var   extFile : TStringList;    
  4. begin
  5.    extFile := TStringList.Create();
  6.    extFile.LoadFromFile(GetCurrentDir + '\index.html');
  7.    writeln(process(extFile.text,'//p[@id="test"]').toString);  
  8.    extFile.Free;    
  9. end.
  10.  

Friendly J.P

I tried internettools, It works fine with Read document, but works difficult with Write (like find a dom and change it's properties).

wytwyt02

  • Jr. Member
  • **
  • Posts: 83
Re: SAX_HTML GetElementById not work
« Reply #5 on: June 06, 2020, 03:14:09 pm »
I can print the attribute name and value with:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMNodeList;
  5.   arrtibutes: TDOMNamedNodeMap;
  6.   i: Integer;
  7. begin
  8.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  9.   PElement := doc.GetElementsByTagName('p');
  10.   arrtibutes := PElement[0].Attributes;
  11.   for i := 0 to Pred(arrtibutes.Length) do
  12.   begin
  13.     DebugLn(arrtibutes[i].NodeName+ ':' + arrtibutes[i].NodeValue); // will print id:test
  14.   end;
  15. end;

But I still do not know how to GetElementById

Short answer:
You can't.

Longer answer:
IDs are maintained in a THashList that is never assigned in the ancestor (TDomDocument) or any of it's descendants. GetElementByID checks the hash list, and when not assigned always returns Nil.

Code: Pascal  [Select][+][-]
  1. function TDOMDocument.GetElementById(const ElementID: DOMString): TDOMElement;
  2. begin
  3.   Result := nil;
  4.   if Assigned(FIDList) then
  5.     Result := TDOMElement(FIDList.Get(DOMPChar(ElementID), Length(ElementID)));
  6. end;
  7.  

oh my god :o, I do not wanna to use Regex, and do not like internettools with strange api, this features is really necessary for a web site crawler.

dsiders

  • Hero Member
  • *****
  • Posts: 1052
Re: SAX_HTML GetElementById not work
« Reply #6 on: June 06, 2020, 03:44:18 pm »
I can print the attribute name and value with:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMNodeList;
  5.   arrtibutes: TDOMNamedNodeMap;
  6.   i: Integer;
  7. begin
  8.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  9.   PElement := doc.GetElementsByTagName('p');
  10.   arrtibutes := PElement[0].Attributes;
  11.   for i := 0 to Pred(arrtibutes.Length) do
  12.   begin
  13.     DebugLn(arrtibutes[i].NodeName+ ':' + arrtibutes[i].NodeValue); // will print id:test
  14.   end;
  15. end;

But I still do not know how to GetElementById

Short answer:
You can't.

Longer answer:
IDs are maintained in a THashList that is never assigned in the ancestor (TDomDocument) or any of it's descendants. GetElementByID checks the hash list, and when not assigned always returns Nil.

Code: Pascal  [Select][+][-]
  1. function TDOMDocument.GetElementById(const ElementID: DOMString): TDOMElement;
  2. begin
  3.   Result := nil;
  4.   if Assigned(FIDList) then
  5.     Result := TDOMElement(FIDList.Get(DOMPChar(ElementID), Length(ElementID)));
  6. end;
  7.  

oh my god :o, I do not wanna to use Regex, and do not like internettools with strange api, this features is really necessary for a web site crawler.

Your example specifically used THTMLDocument from fpc.

You could use TDOMDocument from the Lazarus LazUtils package. It implements GetELementByID, but requires well-formed XML content which HTML5 (also in your example) is not.

Choices, choices....
Preview Lazarus 3.99 documentation at: https://dsiders.gitlab.io/lazdocsnext

wytwyt02

  • Jr. Member
  • **
  • Posts: 83
Re: SAX_HTML GetElementById not work
« Reply #7 on: June 07, 2020, 04:31:18 am »
I can print the attribute name and value with:

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. var
  3.   doc: THTMLDocument;
  4.   PElement: TDOMNodeList;
  5.   arrtibutes: TDOMNamedNodeMap;
  6.   i: Integer;
  7. begin
  8.   ReadHTMLFile(doc, GetCurrentDir + '\index.html');
  9.   PElement := doc.GetElementsByTagName('p');
  10.   arrtibutes := PElement[0].Attributes;
  11.   for i := 0 to Pred(arrtibutes.Length) do
  12.   begin
  13.     DebugLn(arrtibutes[i].NodeName+ ':' + arrtibutes[i].NodeValue); // will print id:test
  14.   end;
  15. end;

But I still do not know how to GetElementById

Short answer:
You can't.

Longer answer:
IDs are maintained in a THashList that is never assigned in the ancestor (TDomDocument) or any of it's descendants. GetElementByID checks the hash list, and when not assigned always returns Nil.

Code: Pascal  [Select][+][-]
  1. function TDOMDocument.GetElementById(const ElementID: DOMString): TDOMElement;
  2. begin
  3.   Result := nil;
  4.   if Assigned(FIDList) then
  5.     Result := TDOMElement(FIDList.Get(DOMPChar(ElementID), Length(ElementID)));
  6. end;
  7.  

oh my god :o, I do not wanna to use Regex, and do not like internettools with strange api, this features is really necessary for a web site crawler.

Your example specifically used THTMLDocument from fpc.

You could use TDOMDocument from the Lazarus LazUtils package. It implements GetELementByID, but requires well-formed XML content which HTML5 (also in your example) is not.

Choices, choices....
What if the html structure not well-formed when use TDOMDocument ?

dsiders

  • Hero Member
  • *****
  • Posts: 1052
Re: SAX_HTML GetElementById not work
« Reply #8 on: June 07, 2020, 05:14:55 am »
What if the html structure not well-formed when use TDOMDocument ?

Well, I'm assuming you'll actually use TXMLDcument, and call ReadXMLFile or TXMLReader to load the file. In that case, an EXMLReadError exception is raised.

There is a lot of good info on the Wiki at: https://wiki.lazarus.freepascal.org/XML_Tutorial
Preview Lazarus 3.99 documentation at: https://dsiders.gitlab.io/lazdocsnext

 

TinyPortal © 2005-2018