Recent

Author Topic: Reading the content of compressed MS Office and LibreOffice files  (Read 262 times)

Gizmo

  • Hero Member
  • *****
  • Posts: 702
Forgive me as I know this has been asked before but I just can't seem to find the posts now. I asked myself some years ago.

Anyway, I need to be able to open and read (not write) modern compressed MS Word, Excel, and similar files who have docx and odt extensions, and similar, and load the readable file content to a buffer (so I don't need all the other internal data...just the main body of the files that the user would see if they opened it in MS Word or LibreOffice Writer).

I have read up on fpVectorial (https://wiki.lazarus.freepascal.org/fpvectorial_-_Text_Document_Support). Using the Hello World example, it does indeed easily CREATE and docx and odt file. But I need to READ and existing one, without making outside calls to the Internet (as the machines my program will run on can't dial out). The author of said library has not been active since 2018 so I don't think he will answer. 

I have got as far as this :

Code: Pascal  [Select][+][-]
  1. uses
  2. ...fpvectorialpkg, fpvectorial...
  3.  
  4. procedure TForm1.Button1Click(Sender: TObject);
  5. Var
  6.   Document: TvVectorialDocument;
  7.   Page: TvTextPageSequence;
  8.   Paragraph: TvParagraph;
  9.   AData: TvVectorialFormat;
  10. Begin
  11.   Document := TvVectorialDocument.Create;
  12.   Try
  13.     Document.ReadFromFile('Hello_World.docx', AData);  // WHAT NOW?
  14.   Finally
  15.     Document.Free;
  16.   End;
  17. end;
  18.  

I am assuming the content of Hello World.docx ends up in 'AData'? but how do you then read or examine what is in AData, as it is of type 'TvVectorialFormat', which I don't understand how you query. All the examples seem to be about creating and writing data to such files, and not about reading existing. If I had a 20 page word document with 20,000 words in it, I need to add all that to some kind of buffer for parsing by my own functions.

Or is there a better way to read such files? I did read this, but as I state, it requires an Internet connection and I think it is specific to OpenOffice, and perhaps not all generic modern word processing files (https://www.freepascal.org/~michael/articles/openoffice1/openoffice.pdf)

Many thanks
« Last Edit: November 21, 2019, 06:13:12 pm by Gizmo »
Lazarus 2.0.4 and fpc 3.0.4 - Linux Mint 19 LTS, Windows 10 64 and Mac OSX Catlina
Useful Page to remember : http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386

winni

  • Hero Member
  • *****
  • Posts: 1610

Gizmo

  • Hero Member
  • *****
  • Posts: 702
Re: Reading the content of compressed MS Office and LibreOffice files
« Reply #2 on: November 21, 2019, 07:00:21 pm »
Ah yes, that's the one I wrote but I can't believe it was back in 2016! I thought it was last year! There is one or two others, written by fellow users, that I recall reading over the last few weeks as well.

But the thread, whilst replied to, it still doesn't really solve the issue. The basic solution stated there is to effectively rename all such files as zips, and then use a zip traversal procedure to find the inner "document.xml" file (in the case of MS Word), open that, and then use XML traversal. But I don't really want my program renaming files, and then renaming them back again when it is finished.  And even if I did, it seems like a bit of  "hack" to achieve what must be a fairly common need these days. Writing tools that can create and read Office files (Mircosoft, LibreOffice etc) must be a common requirement. And, as this post describes, one such library seems to exist (fpVectorial) but I just can't currently see how it is used to "get" the content that would be listed in "document.xml".
Lazarus 2.0.4 and fpc 3.0.4 - Linux Mint 19 LTS, Windows 10 64 and Mac OSX Catlina
Useful Page to remember : http://wiki.freepascal.org/Cross_compiling#From_Linux_x64_to_Linux_i386

 

TinyPortal © 2005-2018