Recent

Author Topic: Using fpspreadsheet to write huge xlsx files  (Read 39952 times)

wp

  • Hero Member
  • *****
  • Posts: 13433
Using fpspreadsheet to write huge xlsx files
« on: July 10, 2014, 11:19:54 pm »
There has been discussion that fpspreadsheet is not suited for large data files. Since the entire spreadsheet grid is held in memory all the time this is certainly true.

In order to overcome this limitation, a "virtual mode" has been introduced in svn revision 3306 of the package. "Virtual", in this context, is to be understood in the same way as in "virtual listbox" or "virtual treeview". It means that writing of a spreadsheet file does not require permanently available data stored in a worksheet, but gets the data temporarily by means of an event. Therefore, in principle, the file written can be of any size. A limitation, of course, is given by the temporary memory streams used for collecting data. But there is an option to avoid this limitation by using file streams.

The main application, I guess, is exporting large database files to spreadsheet formats.

Here is a short "getting started":
  • Activate "woVirtualMode" of the worksheet's new "WritingOptions".
  • If you expect "many" rows overflowing the available memory activate also the option "woSaveMemory" which triggers usage of TFileStream for temporary storage.
  • Using the worksheet's properties "VirtualRowCount" and "VirtualColCount" define how many data will be saved.
  • Before calling the "WriteToFile" method of the spreadsheet open the dataset to be exported.
  • Write an event handler for the event "OnNeedCellData" of the worksheet. This event is called for every cell within the range defined by VirtualRowCount and VirtualColCount and expects the value to be stored for this cell. The loop progresses first along the rows, then along the columns. In database terms, the columns correspond the the fields of a dataset, the rows correspond to the records. Please note that you are responsible for advancing the dataset cursor if the last field of a record (the last column of a row) has been handled.
  • Call "WriteToFile"
  • Of course, you must be aware that writing of huge files takes its time. You can use the OnNeedCellData event to provide a progress display.
There is a little (non-database) demo ("test_virtualmode") in the folder "examples/other" of the fpspreadsheet installtion.

In my tests, the Windows task manager does not show any increase of memory usage in this mode.

Currently this feature is only available for xlsx files, but the writers of other spreadsheet format will be upgraded shortly.


Mike.Cornflake

  • Hero Member
  • *****
  • Posts: 1269
Re: Using fpspreadsheet to write huge xlsx files
« Reply #1 on: July 11, 2014, 08:14:07 am »
I am very impressed, both with your response time and your dedication to fpspreadsheet.

Quick question (for interest only):  I presume it's only the content that you're immediately writing out?  I assume styles would still be cached and written out last?

Another question (again for interest only):  Do you handle exporting images in fpspreadsheet?

Actually both of these questions have a hidden agenda :-):  When I get time (likely Sept) I now fully intend to rewrite the docx writer in line with the lessons you've learned over the past few days.  Everything you've gone through is applicable there as well.
Lazarus Trunk/FPC latest fixes on Windows 11
  I'm getting old and stale.  Slowly getting used to git, I'll get there...

wp

  • Hero Member
  • *****
  • Posts: 13433
Re: Using fpspreadsheet to write huge xlsx files
« Reply #2 on: July 11, 2014, 10:15:51 am »
Quote
I presume it's only the content that you're immediately writing out?  I assume styles would still be cached and written out last?
The present version only writes the content. Styles, in spreadsheet files, are written first and referred to by an index when the cell is written. If that feature should become important I think I could use the styles of some template cells in the worksheet. But this is not on my list so far.

Quote
Do you handle exporting images in fpspreadsheet?
No.

Mike.Cornflake

  • Hero Member
  • *****
  • Posts: 1269
Re: Using fpspreadsheet to write huge xlsx files
« Reply #3 on: July 11, 2014, 11:52:26 am »
Styles, in spreadsheet files, are written first and referred to by an index when the cell is written. If that feature should become important I think I could use the styles of some template cells in the worksheet. But this is not on my list so far.

Many thanks for the info. 
Lazarus Trunk/FPC latest fixes on Windows 11
  I'm getting old and stale.  Slowly getting used to git, I'll get there...

rvk

  • Hero Member
  • *****
  • Posts: 6953
Re: Using fpspreadsheet to write huge xlsx files
« Reply #4 on: July 11, 2014, 12:55:02 pm »
Wow, you've been busy  :D Great work.

So using woVirtualMode makes TsWorkbook just a passthrough for the TsCustomSpreadWriter.
And using woSaveMemory makes the TsCustomSpreadWriter write to temporary files (instead of memory).
It works perfectly. No memory usage at all with 20.000x100 cells.

When you're going to implement this in sfOpenDocument and sfExcel8 the OnNeedCellData would also need a formatting parameter. The .GetCell and ^.UsedFormattingFields wouldn't work anymore because the cells are already written to disc. But if there is a parameter (for OnNeedCellData) with a complete format-structure this can be used to fill styles.xml directly.

My first test:
Code: [Select]
Running: Building TsWorkbook and Writing sfOOXML
  All strings with Windows 7 pro and SSD drive
----------
Rows x Cols WritingOptions                Build  Write
10.000x100  [                           ]   0,7    5,0
10.000x100  [woVirtualMode              ]   0,0    7,2
10.000x100  [               woSaveMemory]   0,6   11,9
10.000x100  [woVirtualMode, woSaveMemory]   0,0   12,9
----------
20.000x100  [                           ]   1,3   10,1
20.000x100  [woVirtualMode              ]   0,0   14,6
20.000x100  [               woSaveMemory]   1,3   23,6
20.000x100  [woVirtualMode, woSaveMemory]   0,0   25,6
----------
30.000x100  [                           ]   2,0   15,3
30.000x100  [woVirtualMode              ]   0,0   21,9
30.000x100  [               woSaveMemory]   2,0   35,1
30.000x100  [woVirtualMode, woSaveMemory]   0,0   38,1
----------
40.000x100  [                           ]   2,7   20,5
40.000x100  [woVirtualMode              ]   0,0   29,8
40.000x100  [               woSaveMemory]   2,7   46,0
40.000x100  [woVirtualMode, woSaveMemory]   0,0   50,6
Ready

At this time i keep getting an error during more testing with a temp-file creation in zipper but couldn't yet pinpoint why.
<< Unable to create file ".tmp". >>  Somewhere the FFilename in zipper is empty and it throws an error. I'll report back if i found something.

I'm also looking for a memory leak. Maybe they're connected  %)

rvk

  • Hero Member
  • *****
  • Posts: 6953
Re: Using fpspreadsheet to write huge xlsx files
« Reply #5 on: July 11, 2014, 02:20:32 pm »
I found two "problems" with the new xlsxooxml.pas.

1) With the xlsxooxml.pas/fpszipper.pp combination.

In fpszipper you have a constant:
Code: [Select]
   DefaultInMemSize = 256*1024; { Files larger than 256k are processed on disk   }
(It's assigned to FInMemSize in .Create)
When a file is larger than 256KB it is processed on disc.
But according to this code (starting at line 1392, fspzipper.pp rev.3306):
Code: [Select]
  Try
    StartZipFile(Item);
    If (FInfile.Size<=FInMemSize) then
      ZipStream:=TMemoryStream.Create
    else
      begin
      TmpFileName:=ChangeFileExt(FFileName,'.tmp');
      ZipStream:=TFileStream.Create(TmpFileName,fmCreate);
      end;
it uses FFileName for a temporary filename.

But in xlsxooxml.pas there isn't a Filename set for TZip in WriteToStream. So creating large spreadsheet will always crash.
(I added FZip.FileName := 'temp___'; at line 618 and it was solved)

Actually (fps)zipper should check if FileName is set and if it's not it should assign something to it before continuing (or give a proper exception).

2) I found that the files created with woVirtualMode are a few KB larger.
It seems that sheet1.xml are different for both methods.
I see in one sheet1.xml you do a <row r="19" spans="1:10000"> and in the other <row r="19" spans="1:100">
In xlsxooxml.pas you set it to Rowcount for woVirtualMode and to Colcount for [].
Typo ???  ;D


My test-run:
Code: [Select]
Running: Building TsWorkbook and Writing sfOOXML
  All strings with Windows 7 pro and SSD drive
----------
Rows x Cols WritingOptions                Build  Write
10.000x100  [                           ]   1,6    5,1
10.000x100  [woVirtualMode              ]   0,0    7,2
10.000x100  [               woSaveMemory]   1,6   12,1
10.000x100  [woVirtualMode, woSaveMemory]   0,0   15,5
----------
20.000x100  [                           ]   3,3   10,5
20.000x100  [woVirtualMode              ]   0,0   14,6
20.000x100  [               woSaveMemory]   3,2   23,9
20.000x100  [woVirtualMode, woSaveMemory]   0,0   30,0
----------
30.000x100  [                           ]   4,9   15,6
30.000x100  [woVirtualMode              ]   0,0   21,7
30.000x100  [               woSaveMemory]   4,8   35,3
30.000x100  [woVirtualMode, woSaveMemory]   0,0   44,6
----------
40.000x100  [                           ]   6,5   20,8
40.000x100  [woVirtualMode              ]   0,0   29,0
40.000x100  [               woSaveMemory]   6,5   46,5
40.000x100  [woVirtualMode, woSaveMemory]   0,0   59,0
----------
Ready

(should i learn how to write bugreports in bugtracker :D)
« Last Edit: July 11, 2014, 02:24:50 pm by rvk »

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Using fpspreadsheet to write huge xlsx files
« Reply #6 on: July 11, 2014, 02:39:10 pm »
(should i learn how to write bugreports in bugtracker :D)
AFAIC, yes - looks like a good idea if you want to patch tzipper ;)
(Fpspreadsheet: up to wp - he's the one working on it these days)
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12721
  • FPC developer.
Re: Using fpspreadsheet to write huge xlsx files
« Reply #7 on: July 11, 2014, 02:46:00 pm »
(should i learn how to write bugreports in bugtracker :D)

As Reinier says, in general yes. Now no, since it is having problems :-)

BigChimp

  • Hero Member
  • *****
  • Posts: 5740
  • Add to the wiki - it's free ;)
    • FPCUp, PaperTiger scanning and other open source projects
Re: Using fpspreadsheet to write huge xlsx files
« Reply #8 on: July 11, 2014, 02:46:56 pm »
Bugtracker works fine again here...
Want quicker answers to your questions? Read http://wiki.lazarus.freepascal.org/Lazarus_Faq#What_is_the_correct_way_to_ask_questions_in_the_forum.3F

Open source including papertiger OCR/PDF scanning:
https://bitbucket.org/reiniero

Lazarus trunk+FPC trunk x86, Windows x64 unless otherwise specified

wp

  • Hero Member
  • *****
  • Posts: 13433
Re: Using fpspreadsheet to write huge xlsx files
« Reply #9 on: July 11, 2014, 02:52:51 pm »
Quote
the OnNeedCellData would also need a formatting parameter
So you're saying that you want to format the data in the spreadsheet as well? As I wrote in the other posting, my idea is to provide template cells in the worksheet for each style requested and pass their style index to an extended OnNeedCellData event.

However, it is a bit difficult to find out this index from outside the writer. The writing process of fpspreadsheet is pretty much encapsulated, there are only a few hooks into the outside world.

It would be another possibility to pass the row/col coordinates of the template cell to the OnNeedCellData event, but then the writer would have to seek for the style index again and again for each of the millions of cells. Since the format list will be short I guess this would not cost too much, but it's certainly better to sit down and look for a better solution.

Quote
i keep getting an error during more testing with a temp-file creation in zipper
I am seeing such errors also in BigChimp's unit tests which have a lot of temp file access without the zipper being involved. Therefore, I'd blame the fast consecutive deletion and recreation of temporary files for this issue instead. No idea on the mechanism, though.

I am also observing that sometimes the temporary files are not deleted (they are in %(APPDATA)\Local\Temp and have a name beginning with "fps" and 2-3 letters indicating the data type, like STY for "styles"). The files are deleted by calling "DeleteFile" which exits without an error, but the files are still there. ?!

Quote
Bugtracker ... up to wp
Reports on bugs are always welcome...

rvk

  • Hero Member
  • *****
  • Posts: 6953
Re: Using fpspreadsheet to write huge xlsx files
« Reply #10 on: July 11, 2014, 05:24:41 pm »
Quote
the OnNeedCellData would also need a formatting parameter
So you're saying that you want to format the data in the spreadsheet as well?
I'll take it these changes are going to be implemented for fpsopendocument.pas as well? And at this moment there already is the possibility to do some formatting like stated in the examples:
Code: [Select]
  MyWorksheet.WriteUTF8Text(3, 3, '[N,E]');// D4
  MyCell := MyWorksheet.GetCell(3, 3);
  MyCell^.Border := [cbNorth, cbEast];
  MyCell^.UsedFormattingFields := [uffBorder];
If you're going to implement woVirtualMode there, i imagine this would not be possible anymore (because everything is already passed through to the writer-class to a xml-stream).
(And i don't expect you're going to implement the GetCell-function when woVirtualMode is used, are you?)

To keep these formatting options in fpsopendocument.pas you could just add a parameter (besides AData) for format (i.e. AFormat) with a structure with formatting options to OnNeedCellData. You can apply them directly in fpsopendocument.pas to the cell being written (or at least build those format_lists and point the cell being written to that index). (or is this not possible?). When this structure is in place it could be later implemented for xlsxooxml.pas too (but that's not on your list yet).

I'm not sure what you meant with "template cell" but if the AData-parameter of OnNeedCellData could even be a TCell structure (without the Row/Count) and you would have everything you'll need to fill the cell (including formatting options). In WriteWorksheet you loop through VirtualRowCount and VirtualColCount and receive this structure (from NeedCellData) (that's why Row/Count are not needed, they are forced by the caller) and it would be easy to build the styles-list. Although this style list would need to be in memory because otherwise you would need one style per cell. Keeping it in memory may not need as much space because duplicate styles will be reused (as it is now). The user would have to know what formatting to apply during the NeedCellData but i don't see that as a problem.

Maybe i'm seeing that completely wrong but if you want to keep formatting with woVirtualMode you'll have to sacrifice some memory (which would be much if it's the same for a lot of cells) and if you're not using formatting the styles_lists would be empty anyway.

Quote
I am seeing such errors also in BigChimp's unit tests which have a lot of temp file access without the zipper being involved. Therefore, I'd blame the fast consecutive deletion and recreation of temporary files for this issue instead. No idea on the mechanism, though.
Ok, so this is a bug in fpszipper.pp (it is a bug because this should be handled, maybe with adding delays when the file is inaccessible or switching to another file).
It's weird, though, that after i added a FZip.Filename := in xlsxooxml.pas the error went away.
(Maybe Windows doesn't like rapid creation and deletion of files without a name but only extension i.e. ".tmp" which is what happens if FFilename is not set)
« Last Edit: July 11, 2014, 05:26:53 pm by rvk »

wp

  • Hero Member
  • *****
  • Posts: 13433
Re: Using fpspreadsheet to write huge xlsx files
« Reply #11 on: July 11, 2014, 06:31:30 pm »
Quote
I'll take it these changes are going to be implemented for fpsopendocument.pas as well?
Surely, but ods is a bit crazy because there are no row/col indexes. The entire rectangle spanned from the first row/col to the last row/col has to be filled with an occupied cell or a placeholder telling that the cell is empty. Therefore, there are a few cases to consider and this is the reason why it is not done yet.

Quote
this would not be possible anymore
No. In virtual mode, there are no "cells" because they would occupy memory in the cell grid.

Quote
MyWorksheet.GetCell(3, 3)
You should not call GetCell because it creates a cell, i.e. occupies memory. The correct function would be "FindCell". But as I said, in virtual mode there it won't find anything.

Your idea with the format list: Certainly, but we don't have it - it is as simple as that. fpspreadsheet stores formats in the cells, not in a list (a temporary format list is created during reading and writing, but you don't know the format index of a given cell in the first place).  It is high priority on my to-do list, but it will break existing code (cell^.NumberFormat := ... won't work any more!). Before doing this step I want to complete fpspreadsheet within the existing framework, in particular bring ooxml to the same level of compleness of the other formats and maybe add some more formula-stuff. After that we can discuss how to continue - I'd vote for a major rewrite making cells to classes instead of records and trying to have least memory attached to cells. Probably this will be a different incompatible package (such as "fpspreadsheet_new").

Quote
if you want to keep formatting with woVirtualMode you'll have to sacrifice some memory
I don't think so - I'll have to find a place within the writing process where the formatting information is completely known and an event handler can be hooked in.

BTW - and very important: I think formatting in virtual mode should be is a constant format along each column, a different format for one or a few header rows would be acceptable. But is will not be possible to assign an arbitrary format to an arbitrary virtual cell. The basic idea is that of database export: You have a table with n fields and m records. A field corresponds to a column, a record to a row. All cells of the same field, i.e. along a given column, should have the same format. Except, maybe, for a header row.


rvk

  • Hero Member
  • *****
  • Posts: 6953
Re: Using fpspreadsheet to write huge xlsx files
« Reply #12 on: July 11, 2014, 07:25:25 pm »
Your idea with the format list: Certainly, but we don't have it - it is as simple as that. fpspreadsheet stores formats in the cells, not in a list (a temporary format list is created during reading and writing, but you don't know the format index of a given cell in the first place).
Yeah... i don't think i'm explaining it clear enough. (Is FFormattingStyles not a format-list at this point? Which you use to search for same formatting before assigning a style-index to a cell during writing?)

From the source:
Quote
{@@
  Checks if the formatting style of a cell is in the list of manually added
  FFormattingStyles and returns its index, or -1 if it isn't
  @param  AFormat  Cell containing the formatting styles which are seeked in the FFormattingStyles array.
}
function TsCustomSpreadWriter.FindFormattingInList(AFormat: PCell): Integer;       

Now with woVirtualMode your looping through VirtualRowCount and VirtualColCount so every cell is received (empty or not) via the NeedCellData. I'm just saying if we've got the possibility of also giving an AFormat (at that point) you could check if that style is already used (get the correct index from style-memory like FFormattingStyles is now). If so write that style-index together with the cell to the sheet-stream. If it is not used, store it in memory (i.e. FFormattingStyles) for later reference and use that new index for the cell. (The style.xml-stream could also be appended directly because we have the FColumnStyleList in memory).

That way every cell can have it's own formatting (if every cell has different formatting it would take memory but if lots of cells have the same formatting the memory requirement will be low, it's up to the user if he wants to apply lots of different styles).

That's just how i see it now... Don't worry if i'm explaining this not clear enough.
In that case i will see how the fpsopendocument.pas evolves and get back when i can be somewhat more concrete (with code-examples).

Quote
I'd vote for a major rewrite making cells to classes instead of records and trying to have least memory attached to cells.
Yeah... that was also my thought... Make the properties of the TCell a sort of TList or something so when properties aren't used they don't take any space.

Quote
Quote
if you want to keep formatting with woVirtualMode you'll have to sacrifice some memory
I don't think so - I'll have to find a place within the writing process where the formatting information is completely known and an event handler can be hooked in.
That would almost always mean two passes or like i suggested above collecting the formatting information in NeedCellData and storing the unique styles in memory (which should not take that much memory).

Quote
But is will not be possible to assign an arbitrary format to an arbitrary virtual cell.
... Except, maybe, for a header row.
That would be a shame because it would mean less control over the overall design of your sheets.
(and how about a footer row)

Not sure if you noticed my earlier post about the xlsxooxml.pas typo in <span> (so i include it here again):
2) I found that the files created with woVirtualMode are a few KB larger.
It seems that sheet1.xml are different for both methods.
I see in one sheet1.xml you do a <row r="19" spans="1:10000"> and in the other <row r="19" spans="1:100">
In xlsxooxml.pas you set it to Rowcount for woVirtualMode and to Colcount for [].
Typo ???  ;D
« Last Edit: July 11, 2014, 07:29:08 pm by rvk »

wp

  • Hero Member
  • *****
  • Posts: 13433
Re: Using fpspreadsheet to write huge xlsx files
« Reply #13 on: July 11, 2014, 11:22:37 pm »
Quote
Now with woVirtualMode your looping through VirtualRowCount and VirtualColCount so every cell is received (empty or not) via the NeedCellData. I'm just saying if we've got the possibility of also giving an AFormat (at that point) you could check if that style is already used (get the correct index from style-memory like FFormattingStyles is now). If so write that style-index together with the cell to the sheet-stream. If it is not used, store it in memory (i.e. FFormattingStyles) for later reference and use that new index for the cell. (The style.xml-stream could also be appended directly because we have the FColumnStyleList in memory).

Let me repeat this: you want to extend the OnNeedCellData by a Format parameter which is a cell:
Code: [Select]
type
  TNeedCellDataEvent = procedure(Sender: TObject; ARow,ACol:Cardinal; out AData: variant;
    out AFormat: TCell) of object;
Then in the event handler you pass data and format to the writer.

The problem is that the format must be found in the FormattingStyles array to get its index. If not found you say that it could be added to the array. But this is not possible because this styles list already has been written at this time. In OOXML, that's not such a severe problem because the styles are in a separate file which could be written later, but in BIFF it is severe since this is a binary file with many internal pointers. I definitely will not patch this file.

No, the way to go must be to define all formatting styles before writing and then hook into the writing process once the formatting styles have been collected by the writer's ListAllFormattingStyles. When this is completed an event must fire, call it "OnFormattingStylesCollected". In its handler you can seek the format templates and determine their style index. Store these indexes somewhere. Later, during writing the cells, the "OnNeedCellData" event is fired, and you decide on the basis of the provided row and column index which style index should be used. This makes sure that only existing formats are used.

Before implementing this, however, I'd prefer to interrupt this activity for the moment and add basic formatting to xlsooxml and activate virtual mode for ods first. Otherwise there would be only biff for testing, and I know that there are some subtle differences. So I have to ask you for your patience...

I did indeed miss your other posting of this afternoon. The "spans" issue was a typo as you suggested, it is fixed. There was another typo with a missing closing xml bracket which would have crashed writing of empty cells. I'll look at the filename issue of the zipper later.

@BigChimp, if you are reading this: I did not go into the zipper in detail so far. But the header notes of the fpszipper say that fpszipper should be replaced by the fpc zipper after FPC 2.7.1, probably already after 2.6.4. Should the version in the "$IF" be updated? And will the fpc zipper have the same issue that rvk reports?
 

rvk

  • Hero Member
  • *****
  • Posts: 6953
Re: Using fpspreadsheet to write huge xlsx files
« Reply #14 on: July 12, 2014, 12:09:11 am »
you want to extend the OnNeedCellData by a Format parameter which is a cell
....
If not found you say that it could be added to the array. But this is not possible because this styles list already has been written at this time.
Yep. That's how i saw the call to OnNeedCellData .
But i didn't mean "added to the array" (because that stream is already written) but (also) added to a small new memory-structure with unique styles. For Opendocument and OOXML this shouldn't be a problem but i'm not very familiar with the biff-format so i can't know if that a problem there.

The way you described (with first collecting the styles) would be basically the same (for .xlsx and .ods) because your looping though the cells row by column so all styles would be gathered sequentially and are directly indexed in memory for later and current cells. (But i can't speak for the biff-format). I'll wait patiently on your implementation and see how it compares to my vision ;)

Don't worry about the time-frame. It's not a critical point at my end.
(so take your time which you should always do)

 

TinyPortal © 2005-2018