Lazarus

Programming => General => Topic started by: totya on July 15, 2019, 09:18:19 pm

Title: Raw Data from TDOMNode
Post by: totya on July 15, 2019, 09:18:19 pm
Hi, I use laz2_dom and laz2_xmlread, so can I read the "raw" mean:untouched value of a node?

For example an xml value (element) this:
Quote
<B>TODO: </B>

When I read it (node.TextContent) I got
Quote
TODO


Thanks!
Title: Re: Raw Data from TDOMNode
Post by: wp on July 17, 2019, 11:28:45 am
You must recursively iterate through the child nodes of <ss:Data> along with their attributes and reconstruct the original string.

The following code can be used in the demo of https://forum.lazarus.freepascal.org/index.php/topic,46069.msg327080.html#msg327080.
Code: Pascal  [Select]
  1. procedure RebuildChildNodes(ANode: TDOMNode; var AText: String);
  2. var
  3.   nodeName: String;
  4.   s: String;
  5.   i: Integer;
  6. begin
  7.   if ANode = nil then
  8.     exit;
  9.   while ANode <> nil do begin
  10.     nodeName := ANode.NodeName;
  11.     if nodeName = '#text' then
  12.       AText := AText + ANode.NodeValue
  13.     else begin
  14.       s := '';
  15.       for i := 0 to ANode.Attributes.Length-1 do
  16.         s := Format('%s %s="%s"', [s, ANode.Attributes.Item[i].NodeName, ANode.Attributes.Item[i].NodeValue]);
  17.       AText := Format('%s<%s%s>', [AText, nodeName, s]);
  18.       s := '';
  19.       RebuildChildNodes(ANode.FirstChild, s);
  20.       if s <> '' then
  21.         AText := Format('%s%s</%s>', [AText, s, nodeName]);
  22.     end;
  23.     ANode := ANode.NextSibling;
  24.   end;
  25. end;
  26.  
  27. [...]
  28.             while data_node <> nil do begin
  29.               nodeName := data_node.NodeName;
  30.               if nodeName = 'ss:Data' then begin
  31.                 s := '';
  32.                 RebuildChildNodes(data_node.FirstChild, s);
  33.                 StringGrid1.Cells[c, r] := s;
  34.                 inc(c);
  35.               end;
  36. [...]
Title: Re: Raw Data from TDOMNode
Post by: totya on July 17, 2019, 05:54:44 pm
Hi master!

 :o

Seems to me it's working! Thank you!

But need my modified code too, because with the real files (I sent you one of them) column order is wrong, so need:

Code: Pascal  [Select]
  1.           if nodeName = 'Cell' then
  2.           begin
  3.  
  4.             // #1 Read index if available...
  5.             s := GetAttrValue(cell_node, 'ss:Index');
  6.             if s <> '' then
  7.               c := StrToInt(s);
  8.  
  9.             data_node := cell_node.FirstChild;
  10.  
  11.             // #2 if no child (without data), then increase index...
  12.             if data_node = nil then
  13.               Inc(c);
  14.  
  15.             while data_node <> nil do
  16.             begin
  17.               nodeName := data_node.NodeName;      
  18.  

But as I see it need for the excelxmlwrite created test.xml file too...
Title: Re: Raw Data from TDOMNode
Post by: wp on July 17, 2019, 06:23:55 pm
Sorry I don't fully understand. You mean the code following the comment "// #2..."? But I thought that the xml format adds an "ss:Index" attribute to the "Cell" node when cells left to the current one are empty. Nevertheless, I think that your code is not harmful and fixes an issue when the writing software does not use the "ss:Index" attribute - I'll add it to the "official" reader.
Title: Re: Raw Data from TDOMNode
Post by: totya on July 17, 2019, 06:38:43 pm
Sorry I don't fully understand. You mean the code following the comment "// #2..."? But I thought that the xml format adds an "ss:Index" attribute to the "Cell" node when cells left to the current one are empty. Nevertheless, I think that your code is not harmful and fixes an issue when the writing software does not use the "ss:Index" attribute - I'll add it to the "official" reader.

As I wrote, you should look a sample file what I sent to you.

#1 needed certainly.
#2 needed to, for empty cells(!), for example:

Code: Pascal  [Select]
  1. <Cell ss:StyleID="s55"/>
  2. <Cell ss:StyleID="s55"/>

But as I see "simple" sWorkbookSource/sWorksheetGrid handle this situation...
Title: Re: Raw Data from TDOMNode
Post by: totya on July 17, 2019, 06:53:52 pm
Full example:

Code: Pascal  [Select]
  1. <Row ss:AutoFitHeight="0" ss:Height="200">
  2.     <Cell ss:Index="2" ss:StyleID="s55"><Data ss:Type="String">String0</Data></Cell>
  3.     <Cell ss:StyleID="s99"/>
  4.     <Cell ss:StyleID="s99"/>
  5.     <Cell ss:StyleID="s55"><Data ss:Type="String">String1</Data></Cell>
  6.     <Cell ss:StyleID="s55"><Data ss:Type="String">String2</Data></Cell>
  7. </Row>
  8.  

As you see the "empty" cells are important for the appropriate column position (index).
Title: Re: Raw Data from TDOMNode
Post by: wp on July 17, 2019, 07:26:27 pm
Yes, this works, but an application which writes such files is not very clever because it can blow up the size of the xml file enormously. This is the way how Excel write an xml file with two blank cells between two text cells. It always uses an "ss:Index" attribute to "jump" over a gap:

Code: XML  [Select]
  1.    <Row>
  2.     <Cell ss:Index="3"><Data ss:Type="String">String0</Data></Cell>
  3.     <Cell ss:Index="6"><Data ss:Type="String">String3</Data></Cell>
  4.    </Row>

I re-checked the fpspreadsheet Excel2003/XML reader/writer - they handle empty cells correctly (the writer is a bit clumsy because it always writes an "ss:Index" attribute)
Title: Re: Raw Data from TDOMNode
Post by: totya on July 17, 2019, 07:39:01 pm
Yes, this works, but an application which writes such files is not very clever

I think clever. Because these cells are empties, but as you see, it define style for it, so if these cells got value later, style is preserved. This is the reason why works your fps component correctly these "empty" cells.

Example: I create red backround for an empty excel cell, value is nothing, but the colour is information, isn't?

So these cells are not empties really, because they contains the style.

I suspect, you thought is:

Code: Pascal  [Select]
  1.   <Row>
  2.     <Cell ss:StyleID="s55"><Data ss:Type="String">String1</Data></Cell>
  3.     <Cell></Cell>
  4.     <Cell></Cell>
  5.     <Cell ss:StyleID="s55"><Data ss:Type="String">String1</Data></Cell>
  6.    </Row>
  7.  


... but hopefully I don't see similar to this. :)
Title: Re: [SOLVED by wp master] Raw Data from TDOMNode
Post by: totya on July 17, 2019, 11:40:58 pm
Hi master! :)

As I see, simple <Data> </Data>needs rebuild too, if contains normal xml sign, for example:

Quote
&#10;

... but no success yet with your new procedure.
Title: Re: [SOLVED by wp master] Raw Data from TDOMNode
Post by: wp on July 18, 2019, 12:46:32 am
Look at what is happening in the xml reader of fpspreadsheet. It is pretty complete now and works rather well. This is in unit xlsxml.pas, method TsExcelXMLReader.ReadCell.
Title: Re: [SOLVED by wp master] Raw Data from TDOMNode
Post by: totya on July 18, 2019, 06:39:32 am
Look at what is happening in the xml reader of fpspreadsheet. It is pretty complete now and works rather well. This is in unit xlsxml.pas, method TsExcelXMLReader.ReadCell.

Hi master! :)

It doesn't work. Try this:
Quote
<Cell><Data ss:Type="String">Sample&#10;Text</Data></Cell>

Code: Pascal  [Select]
  1.     if nodeName = 'ss:Data' then begin
  2.             txt := '';
  3.             RebuildChildNodes(node, txt);
  4.             HTMLToRichText(FWorkbook, font, txt, s, cell^.RichTextParams, 'html:');
  5.           end;
  6.  

I will see it after job again :)
Title: Re: [SOLVED by wp master] Raw Data from TDOMNode
Post by: wp on July 18, 2019, 09:14:23 am
Sorry, your messages are a bit cryptic. What does not work? Is the '&#10;' kept in the TextContent? For me everything is ok. What is your Lazarus/fpc version? Are you working with fpspreadsheet or with your own reader? In the latter case, post some compilable code.
Title: Re: [SOLVED by wp master] Raw Data from TDOMNode
Post by: totya on July 18, 2019, 05:45:56 pm
Sorry, your messages are a bit cryptic. What does not work? Is the '&#10;' kept in the TextContent? For me everything is ok. What is your Lazarus/fpc version? Are you working with fpspreadsheet or with your own reader? In the latter case, post some compilable code.

Hi master! :)

I'm sorry for the misunderstanding, so , the topic name is:  Raw Data from TDOMNode

So, I want to read the original (untouched) data values from the xml.

So, when this Data available in the xml:

Quote
<Data ss:Type="String">AA &#10; BB</Data>

When I read it, I want to got exactly this value:
Quote
AA &#10; BB

Sample code attached.

The result:

Quote
GetNodeValue(data_node): "AA
 BB"
data_node.TextContent: "AA
 BB"
After rebuild: String<Data ss:Type="String">AA
 BB</Data>
GetNodeValue(data_node): "AA
 BB"
data_node.TextContent: "AA
 BB"

But I wanted it the original value (raw data):

Quote
AA &#10; BB

Thank you :)

Lazarus version : fixes 2.0 branch, fpc version: fixes 3.2 branch.
Title: Re: Raw Data from TDOMNode
Post by: wp on July 18, 2019, 07:07:58 pm
Now I understand: fpspreadsheet must remove the special codes, and this works. But you want to keep them, and this does not work.

Of course you can pass the extracted string to the function UTF8TextToXMLText of unit fpsxmlcommon - it just replaces the line breaks and other special characters by the xml equivalents (set "ProcessLineEndings" to true in order to replace #10 by '&#10;').

Kind of cumbersome though: First the xml reader removes them, and UTF8TextToXMLText brings them back in... It would be better to force the xml reader to keep them in the first place. I don't know, however, how to do this.

But what exactly do you want to achieve? Maybe laz2_dom and laz2_xmlread are not the correct units for your purpose.

The strange output of the RebuildChildNodes procedure is due to the fact that you do not initialize the string parameter (s) passed to this function. RebuildChildNodes is a recursive function and always adds the node name, node attributes, and node content to this string which gets longer with every recursion level. You simply must set s := '' before you call RebuildChildNodes:
Code: Pascal  [Select]
  1.               if nodeName = 'Data' then
  2.               begin
  3.                 s := GetAttrValue(data_node, 'ss:Type');
  4.                 if (s = 'String') or (s = 'Number') then
  5.                 begin
  6.                   WriteLN(Format('GetNodeValue(data_node): "%s"', [GetNodeValue(data_node)]));
  7.                   WriteLN(Format('data_node.TextContent: "%s"', [data_node.TextContent]));
  8.                   s := '';             // <--------------------- ADDED -----------------<
  9.                   RebuildChildNodes(data_node, s);
  10.                   WriteLN('After rebuild: '+s);
  11.                   WriteLN(Format('GetNodeValue(data_node): "%s"', [GetNodeValue(data_node)]));
  12.                   WriteLN(Format('data_node.TextContent: "%s"', [data_node.TextContent]));
  13.  
  14.                   ReadLN;
  15.                 end
  16.                 else
  17.                   WriteLN('');
  18.               end;
Title: Re: Raw Data from TDOMNode
Post by: totya on July 18, 2019, 08:05:24 pm
But what exactly do you want to achieve? Maybe laz2_dom and laz2_xmlread are not the correct units for your purpose.

Just I want to a read an (office created) xml files with original/untouched values, next step modify/select/copy values, then write to back, or write to the new file. These "special" codes must stay in code. Otherwise in xml must swap the critical chars, see:

https://www.w3schools.com/xml/xml_syntax.asp (https://www.w3schools.com/xml/xml_syntax.asp)
See Entity References section.
These "chars" unfortunatelly converted too while I read values...

My own interpreter as I said under development... but thanks for the many help, and the ideas, master! :) And the great fps component now can handle +1 format :)
Title: Re: Raw Data from TDOMNode
Post by: totya on July 19, 2019, 02:36:55 pm
I was thinking.. and I suspect the "laz2_XMLWrite" needs to handle these special chars while save.

Otherwise the Excel 2003 works corectly, if I create a cell:

AA + alt-enter + BB

in the created xml I found it:

Quote
<Cell ss:StyleID="s21"><Data ss:Type="String">A&#10;BB</Data></Cell>