Recent

Author Topic: larger Tstringlist Capacity  (Read 31086 times)

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
larger Tstringlist Capacity
« on: May 24, 2010, 07:42:06 pm »
OS: = winXP
lazarus 0.9.28.2
fpc 2.2.4

I'm working on a project that converts very large data sets into csv files.  the core of my code uses a tstringlist to hold the data until final output.  At this time, I need to hold the whole file as I do some math on it's elements.  the results are saved on the first line of the final csv file as well as posable adjustment to every after line.

I think I have found the limit of the tstringlist as I get "out of memory" errors {203} on the big data sets.  I looked at the capacity and it looks look it is only used to speed things up.  

web seaches also imply there is a 32K limit with delphi but I'm well over that number with task manager saying I'm using a max of 689 Mbyte.  this number drops down to 10M before and after the tstring is created and freed.

is there another object perhaps in the data access arena that I can replace the tstringlist with?
or
is there a way to increase the tstringlist capacity?

ps: I found a way around most of my problems by switching to a writeln(file) for some of the early processing but the tstring.LoadFromFile code crashes with files over 1/3rd of a gig.
Thank you for your time.
« Last Edit: May 24, 2010, 08:32:19 pm by mas steindorff »
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #1 on: May 24, 2010, 09:42:38 pm »
What do you intend to do in this file after loading it in the TStringList?

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
Re: larger Tstringlist Capacity
« Reply #2 on: May 24, 2010, 10:11:26 pm »
The final goal is a CSV text file that is feed into another 3rd party program.  it runs one line at a time so file size is not an issue for it.

What do you intend to do in this file after loading it in the TStringList?

as one of my processing steps, I save the translated binary data into a simpler CSV file that basically has raw data values for the different channels.  after I reload load this file, I calculate each channel's offset/bias and then remove this value from each reading/line. the simpler file is generated with writeln() due the a append operation I do.  this is how the file gets larger than the .loadfromfile can handle.

New Info:  I replaced the loadfromfile and savetofile with my own code and so far the rest is working.  I checked the limit of the control variable (integer) and I'm far from over running them with only 1,845K lines so I think I may have a fix.  I looked at the tstring.loadfromfile / loadfromstream and could not see any obvious errors so I may have just put the error off by a few Kbytes. :-\
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #3 on: May 24, 2010, 10:41:17 pm »
Do you actually need TStringList? Why don't you do something like this:

Code: [Select]
function readfileline(var filename:textfile; line :integer):string;
var
  n :integer;
  s :string;
begin
  n := 0;
  Reset(filename);
  try
    while (n <= line)and not eof(filename) do
    begin
      Readln(filename, Result);
      inc(n);
    end;
  finally
    system.close(filename);
  end;
end;

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #4 on: May 24, 2010, 10:42:34 pm »
Code: [Select]
procedure TForm1.Button1Click(Sender:TObject);
var
  filename :textfile;
begin
  System.Assign(filename, 'b.txt');
  showmessage(readfileline(filename, 10));
end;

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
Re: larger Tstringlist Capacity
« Reply #5 on: May 24, 2010, 11:04:52 pm »
I'm using the tstring for it's storage option.  that way I can scan it and fix it without getting to involved with other large arrays of different data types here are a snippet of my code.  The TstrFile may hold over 1 million lines at this point.

Code: [Select]
procedure TFMain.PostProcessCSVfile(var TStrFile:TStringList);
const CountToV_mul = 0.319671630859375;
var i,count: Cardinal;
    ch:integer;
    offset:array[1..12] of extended;  // 10 byte float
    TimeOffset, tmp:extended;
    str, off_str:string;
    Tstr: TStringList;
begin
   FillChar(offset,sizeof(offset),0);
   count:=0;                       
   // scan the rest to update bias and offset numbers
   i := 240*60; // skip frist minite of data
   if (i > TStrFile.Count-1) then
      i:= 1; // skip the header
   Tstr := TStringList.Create;
   try   
 // scan the file and calulate the offsets
      while (i < TStrFile.Count) do begin
         str:= TstrFile.Strings[i];
         tstr.CommaText:= str;
         inc(count);
         for ch:=1 to 12 do begin
            tmp := StrToInt(tstr.Strings[ch+5]); // 6th col = chan1
            offset[ch] := offset[ch]+tmp;
            end;
         inc(i);  // next line
      end; // end of while
      if (count > 1) then begin
         for ch:=1 to 12 do begin
             offset[ch] := offset[ch]/count;
             end;
         end;
      off_str := format(';,,,,,offset=,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,%.3f,mul=,%.4f',
            [offset[1],offset[2],offset[3],offset[ 4],offset[ 5],offset[ 6],
             offset[7],offset[8],offset[9],offset[10],offset[11],offset[12],CountToV_mul]);
      TStrFile.Insert(0,off_str);
      // update the header
      str := TStrFile.Strings[1]; // get the header
      str := str+'seconds,uV1,uV2,uV3,uV4,uV5,uV6,uV7,uV8,uV9,uV10,uV11,uV12';
      TStrFile.Strings[1] := str; // save new header
   // save the time of the frist record so it can be removed later
      str:=TstrFile.Strings[2];
      tstr.CommaText:= str;
      TimeOffset := StrToInt(tstr.Strings[1]); // second col = time

   // convert each ADC -> voltage (adc-offset)*mul
      for i:=2 to TStrFile.Count-1 do begin
         str:= TstrFile.Strings[i];
         tstr.CommaText:= str;
         // adjust the time
         tmp := StrToInt(tstr.Strings[1]); // 2nd = time
         tmp := (tmp -TimeOffset) *1.000/1024.0;  // ms ->sec
         tstr.Add(format('%-.4f',[tmp]));  // add to the end of this line
         // now for the channel data
         for ch:=1 to 12 do begin
             str := tstr.Strings[ch+5]; // 7th col = chan1 data
             if str <>'' then begin
                tmp := StrToInt(str);
                tmp := tmp - offset[ch];
                tmp := tmp * CountToV_mul;
                tstr.Add(format('%.3f',[tmp]));
                end;
             end;
         str := tstr.CommaText;
         TstrFile.Strings[i] := str;
      end;
   // ...
   finally
     Tstr.Free;
   end;

I know things would run faster if I did not do the string conversion but the speed of this software is already 1200% faster than it's original and it did not do the post processing


windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

Troodon

  • Sr. Member
  • ****
  • Posts: 484
Re: larger Tstringlist Capacity
« Reply #6 on: May 25, 2010, 08:22:18 am »
You probably need to design your application differently. If your data storage need is in the hundreds of megs, how do you know the target machine will have enough RAM for it? You cannot assume that. That is an issue modern text editors had to deal with; that is why Ultraedit/BBedit and Notepad++ work where Word and Notepad don't. You need your algorithms to work with a data segment, a "window" of data that you can safely store and process in the application memory at any time on any machine.
Lazarus/FPC on Linux

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
Re: larger Tstringlist Capacity
« Reply #7 on: May 25, 2010, 04:04:44 pm »
You probably need to design your application differently.

Yes you are right.  I was just checking to see if there was already a data base object I could use for the tstring replacement. I have not used any of them before.

...If your data storage need is in the hundreds of megs, how do you know the target machine will have enough RAM for it? You cannot assume that.
Windose at least helps me out there.  It will swap the ram to it's page file when I run out of the real stuff.  It's slower than I could do but processing 3 days of data in under 20 minites has already be the 2 hr run of the original Python code.

That is an issue modern text editors had to deal with; that is why Ultraedit/BBedit and Notepad++ work where Word and Notepad don't. You need your algorithms to work with a data segment, a "window" of data that you can safely store and process in the application memory at any time on any machine.

I would not be so sure about notepad++.  It started to fail before my quick program did.  There is a limit on the number of lines it will show as well.  I had to use XVI32 (a hex editor that uses segments correctly) to look at my bigger files.

I have just compleated a testing my new code on a data set that was 2x larger than the one that breaks the tstringlist.loadfromfile and .savetofile.  It will take be a littel time to verify the results but I did not get a "out of memory" error with 3,641,307 lines in the tstring so I may have a quick fix until I can impament a "windowed" approch.
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 12770
  • FPC developer.
Re: larger Tstringlist Capacity
« Reply #8 on: May 25, 2010, 04:11:48 pm »
Forget about Tstringlist for very large files.

1. It reads the entire file into some var, and only then splits it out (requiring twice the memory)
2. it reallocs the buffer it is reading into several times, seriously fragmenting the heap.

Just read it line by line using textfile.

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #9 on: May 25, 2010, 07:13:22 pm »
An alternative solution is TFileStream.

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
Re: larger Tstringlist Capacity
« Reply #10 on: May 25, 2010, 07:35:47 pm »
An alternative solution is TFileStream.

Is there any speed improvement over just using the basic textfile access command?
From what I know (not much any more) a stream reads N number of bytes where as a ReadLn(F,str) will grab one line at a time.
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #11 on: May 25, 2010, 11:47:15 pm »
if you had the lines with the same length in the whole file, for example, you could edit it more easily in the stream or textfile.

Chronos

  • Sr. Member
  • ****
  • Posts: 256
    • PascalClassLibrary
Re: larger Tstringlist Capacity
« Reply #12 on: May 26, 2010, 12:51:46 am »
If you want to handle huge files, you need to work with smaller part of file. Loading whole file into memory is probably easiest way but there are many other ways. Text viewers and editor have to calculate position of all lines and allow fast browsing through the file. File can be read whole to memory or it can do some relative line passing or it can be analyzed on opening and store start of all lines in memory which need much less memory, etc. But according to situation its better to do single pass or two pass where first pass can gather some additional information and second pass can do main work. Pascal compiler work in similar way as multipass, maybe two pass processing of source code.

From what I know (not much any more) a stream reads N number of bytes where as a ReadLn(F,str) will grab one line at a time.

But you can create your own extended version of TFileStream and add ReadLn and WriteLn functionality and adjust it for your purpose. Definitely TFileStream is choice number one for object access to binary file.


Here is some simple example:

Code: [Select]
unit UTextFileStream;

{$mode Delphi}{$H+}

interface

uses Classes, SysUtils;

type

  TTextFileStream = class(TFileStream)
  private
    FBuffer: string;
  public
    function Eof: Boolean;
    procedure WriteLn(Text: string);
    function ReadLn: string;
    function Seek(const Offset: Int64; Origin: TSeekOrigin): Int64; override;
    function RowsCount: Integer;
  end;

implementation

{ TTextFileStream }

function TTextFileStream.Eof: Boolean;
begin
  Eof := ((Position - Length(FBuffer)) = Size);
end;

function TTextFileStream.ReadLn: string;
const
  BufferLength = 10000;
var
  NewBuffer: string;
  Readed: Integer;
begin
  Readed := 1;
  while (Pos(#13, FBuffer) = 0) and (Readed > 0) do begin
    SetLength(NewBuffer, BufferLength + 2);
    Readed := Read(NewBuffer[1], BufferLength);
    SetLength(NewBuffer, Readed);
    FBuffer := FBuffer + NewBuffer;
  end;
  if Pos(#13, FBuffer) > 0 then begin
    Result := Copy(FBuffer, 1, Pos(#13, FBuffer) - 1);
    Delete(FBuffer, 1, Pos(#13, FBuffer) + 1);
  end else begin
    Result := FBuffer;
    FBuffer := '';
  end;
end;

function TTextFileStream.RowsCount: Integer;
begin
  Result := 1;
  FBuffer := '';
  Seek(0, soBeginning);
  while not Eof do begin
    ReadLn;
    Inc(Result);
  end;
  Seek(0, soBeginning);
end;

function TTextFileStream.Seek(const Offset: Int64;
  Origin: TSeekOrigin): Int64;
begin
  if Origin = soCurrent then
    Result := inherited Seek(Offset - Length(FBuffer), Origin)
    else Result := inherited Seek(Offset, Origin);
  FBuffer := '';
end;

procedure TTextFileStream.WriteLn(Text: string);
const
  NewLine = #13#10;
begin
  Seek(0, soCurrent);
  Write(Text[1], Length(Text));
  Write(NewLine, 2);
end;

end.

mas steindorff

  • Hero Member
  • *****
  • Posts: 571
Re: larger Tstringlist Capacity
« Reply #13 on: May 26, 2010, 02:12:12 pm »
I know we are a Little off topic here but my curiosity has flared up  :'(
The front end of my program pulls binary data from different files using the (in my day) standard file calls
  Fhandle := FileOpen(FileName, fmOpenRead+ fmShareDenyNone);
  FileSeek(Fhandle, 0, fsFromBeginning)
 and
  x := FileRead( Fhandle, myFileBlock, sizeof(myFileBlock));
  
These commands were great and allow me to jump around and decode the binary files with surprisingly Little code effort.
  
why it is a TFileStream any better than the these calls?  Does the TFileStream provide some additional speed improvements or cross platform compatibility?
 
Sometime in the near future I plan on working with sockets in order to pass information to and from other programs and I suspect the Tstream or a descendant of will be involved.  Are people more inclined to use the TFileStream because of its compatibility or commonality to objects that pull data from non-file sources?

PS, how do I mark this thread as sovled?
« Last Edit: May 26, 2010, 02:14:34 pm by mas steindorff »
windows 10 &11, Ubuntu 21+ IDE 3.4 general releases

typo

  • Hero Member
  • *****
  • Posts: 3051
Re: larger Tstringlist Capacity
« Reply #14 on: May 26, 2010, 03:04:13 pm »
TFileStream saves you from having to do many operations of  I/O.

 

TinyPortal © 2005-2018