Recent

Author Topic: [SOLVED] Most efficient ways to save data to file - using TMemoryStream?  (Read 983 times)

otoien

  • Jr. Member
  • **
  • Posts: 89
I am working on an application that translates data from one file format to another. The read part is very efficient. I allocate memory for my arrays in big chunks at a time with SetLength, resulting in a read from a file stream plus some processing of a 2.4GB file in about 18sec from an internal 2TB Intel M2 SSD (so somewhat slow, but faster than SATA SSD).

The write part from my generated array to files takes a lot of more time, total about 12 minutes. In taskmanager I can see memory use for the MemoryStream slowly creeping up during the progress of each file, after which it drops after writing a file and start creeping up again. (I am splitting the original array into 4 files, thus the procedures below are run multiple times). If my understanding of MemoryStream is correct, that the code below will first fill the stream and then write the file in one go (supported by the observation that direct write to filestream is much slower), it seems that there is a lot of overhead in piece by piece memory adjustment for the MemoryStream. While I can live with this, is there a way to speed this up, to make memory adjustments in bigger jumps (as I do during the read) ? Should I consider alternatives to MemoryStream?

I tried adding a call to SetSize after the kmax := RecSamples; to the code below, but I still see the same memory increase, not a jump.
Code: Pascal  [Select][+][-]
  1. MSize:=(2*imax*jmax*kmax)+mStream.position;
  2. mStream.SetSize(Msize);

(I am of course aware that write speed of an SSD is much slower than read, but I find the above disproportionate and the slow memory increase indicates that the main bottleneck is not the SSD speed).

Here is the code snippet in question:
Code: Pascal  [Select][+][-]
  1. procedure WriteCAREDFDataRecords(const ix: integer; const subj: integer;
  2.     mStream: TMemoryStream);
  3.   var
  4.     m,i,j,k,CARBuffVar: Integer;
  5.     rawValue: smallInt;
  6.   begin
  7.     m:=0; //index for buffered CARdata
  8.     imax:=  CAREDFdoc.iNumOfDataRecs;
  9.     jmax := CAREDFdoc.iNumOfSignals;                // Number of signals
  10.     kmax := RecSamples;                    // Total Number of samples per record
  11.     for i := 0 to imax - 1 do
  12.     begin
  13.       for j := 0 to jmax - 1 do
  14.       begin
  15.         CARBuffVar:= SubjVarIdx[CarSubj,j+1]-1;
  16.         for k := 0 to kmax - 1 do
  17.         begin
  18.           if (CAREDFdoc.iNumOfSamples[j] > 0) and
  19.             (m+k<=ix) then
  20.           begin
  21.             rawValue := RawBuffArr[(m+k),CarBuffVar];
  22.             mStream.Write(NToLE(rawValue), 2);
  23.           end;
  24.         end;
  25.       end;
  26.       m:=m+kmax;
  27.     end;
  28.   end;
  29.  
  30.  
  31.   procedure WriteCAREDFStream(const ix: integer; const subj: integer;
  32.     aStream: TStream);
  33.   //const aBaseURI: ansistring);
  34.   var
  35.     mStream: TMemoryStream;
  36.     Stat: Integer;
  37.   begin
  38.     mStream := TMemoryStream.Create;
  39.     if assigned(CAREDFDoc) then
  40.       try
  41.         CAREDFDoc.WriteHeaderToStream(mStream);
  42.         if CAREDFDoc.StatusCode = noErr then
  43.         begin
  44.           WriteCAREDFDataRecords(ix,subj,mStream);
  45.           mStream.Position:=0;
  46.           aStream.CopyFrom(mStream, mStream.Size);
  47.         end;
  48.       except
  49.         CAREDFDoc.StatusCode := saveErr;
  50.       end;
  51.     mStream.Free;
  52.   end;

I am currently using Lazarus 2.08 64-bit on Windows 10.
« Last Edit: May 30, 2020, 05:14:36 am by otoien »
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

Handoko

  • Hero Member
  • *****
  • Posts: 5149
  • My goal: build my own game engine using Lazarus
Re: Most efficient ways to save data to file - using TMemoryStream?
« Reply #1 on: May 29, 2020, 01:48:48 pm »
I haven't done any benchmark. But I think BlockWrite is the fastest and most memory efficient. Maybe you can test it.

jamie

  • Hero Member
  • *****
  • Posts: 6128
Re: Most efficient ways to save data to file - using TMemoryStream?
« Reply #2 on: May 29, 2020, 06:31:36 pm »
Looks like you are doing some fourier transfers. going from samples to complex numbers and maybe back..

 That's really putting the hardware to work!  :)

Why not translate on the fly instead of writing it all back to file in IDFT or DFT etc.. ?
The only true wisdom is knowing you know nothing

otoien

  • Jr. Member
  • **
  • Posts: 89
Re: Most efficient ways to save data to file - using TMemoryStream?
« Reply #3 on: May 30, 2020, 05:14:15 am »
Looks like you are doing some fourier transfers. going from samples to complex numbers and maybe back..
 That's really putting the hardware to work!  :)
Why not translate on the fly instead of writing it all back to file in IDFT or DFT etc.. ?

Jamie, no fourier transfers are involved here, just splitting and resorting of the data into files of a different format. We have lots of data recorded with a commercial data acquisition program called CARecorder - the company went bankrupt and analysis software was inadequate and can no longer be installed. I figured out the file format and I am converting to the European Data Format, EDF, using the PUMA EDFplus library of Johannes Dietrich to write headings etc. conforming to the EDF standard. Most of the numbers are kept in the original smallint format, only a few of the variables have some calculations done to them. The initial read into the two-dimensional RawBuffArr including those calculations without attempting any output is reasonably  fast (about 18sec for a 2.4GB file). A direct write of those data to a FileStream took 5 times as long as when writing to TMemoryStream (about 12 min) when tested, so not an alternative.

I haven't done any benchmark. But I think BlockWrite is the fastest and most memory efficient. Maybe you can test it.

Thanks Handoko, I took a look into that. However I realized it is best to use the stream as the EDF library depends on using streams to write the header.

As has happened on earlier occasions when I have asked for help, your remark made me start alternative thinking and re-examine the bottlenecks. It really did not make sense, and I found that it was not the MemoryStream at all.  When all stream writes were commented out it took almost as long time.  Then I realized that the statement
Code: Pascal  [Select][+][-]
  1. if (CAREDFdoc.iNumOfSamples[j] > 0)
is actually referring to a function in the EDF class that converts from a string (the "number" format of the EDF header) to integer values - not so wise to include in that loop! Without that statement, I am now translating a 2.4 GB file into four 674.5 GB EDF files in 40 seconds. So problem is solved.
Unless otherwise noted I always use the latest stable version of Lasarus/FPC x86_64-win64-win32/win64

 

TinyPortal © 2005-2018