Recent

Author Topic: Need advice from pros - processing large CSV files  (Read 12765 times)

Fantablup

  • Full Member
  • ***
  • Posts: 171
Re: Need advice from pros - processing large CSV files
« Reply #15 on: July 10, 2021, 08:40:55 pm »
Hehe  :)

Look at the Tbufstream.
You can read line by line or byte by byte or character by character.

I would go for this way.

I will actually use tbufstream for my own projects. I am using TStream right now.

Tony Stone

  • Sr. Member
  • ****
  • Posts: 279
Re: Need advice from pros - processing large CSV files
« Reply #16 on: July 10, 2021, 08:42:00 pm »
@Tony Stone,
Would you please share some more details?

Perhaps:
How "huge" are those CSV files? 10, 100, MB, GB, TB?
What is your target OS, how many RAM memory you have?
What is the reason to involve threads into?
Is the processing includes relations between different parts of the CSV file, i.e. are the cells somehow related and how much dispersed?
Are the files immutable during processing or not?

Files so far range from 500MB to 3GB or more.  Future could be 10GB files
Windows, Linux, Mac would be cool to target as well.
I am using threads so that it will hopefully be faster, I want my user to continue using my GUI while the parsing is happening in the background.
Not sure what you mean by relations within CSV files.... i really just need to pull one line at a time from the source CSV and write one of its columns to a new text file.
Immutable during processing?  Well, they wont be read or written to by any other programs... they are static.

alpine

  • Hero Member
  • *****
  • Posts: 1412
Re: Need advice from pros - processing large CSV files
« Reply #17 on: July 10, 2021, 09:22:16 pm »
So the user may add several files.  Some of the files I am testing range from 500MB to 3GB.  Some of the files have multiple "columns".  My user will be interested in only saving one particular column to the new "target" file which will more than likeley be the first column of the source CSV files.  Yes these will be static files.  User will have them downloaded from alternate sources... They just need to parse them into a simpler format... basically take the items of interest and put it into a new file...  I have tried stringgrid.loadfrom file and it is painfull slow. lol  My users will be probably on Windows or Linux.  I have been focusing on making my application work in both Windows and Linux... I have no Mac to test with sadly.

I will also look into csvreadwrite as someone else posted.  Thank you!  And to answer that person, I just need to read one line at a time from the file... decide what to do with it(probably half will be written to new file)... then move onto the next source file and do the same thing...

So it pretty straightforward then ... as wp suggested, it is good to use TCSVParser since it parses the file on-the-fly, without loading it in memory. Then you can use TReadBufStream as Fantablup also suggested for speeding up the I/O. Same for the output: use TCSVBuilder and TWriteBufStream.

The snippet should look like this (not compiled nor tested):
Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, csvreadwrite, bufstream;
  3.  
  4. procedure CSVProc(AFilename, ATargetFilename: String; AColumnOfInterest: Integer);
  5. var
  6.   FS, TS: TFileStream;
  7.   RBS: TReadBufStream;
  8.   WBS: TWriteBufStream;
  9.   Builder: TCSVBuilder;
  10. begin
  11.   FS := TFileStream.Create(AFilename, fmOpenRead + fmShareDenyWrite);
  12.   try
  13.     RBS := TReadBufStream.Create(FS);
  14.     TS := TFileStream.Create(ATargetFilename, fmCreate);
  15.     try
  16.       WBS := TWriteBufStream.Create(TS);
  17.       Builder := TCSVBuilder.Create;
  18.       try
  19.         Builder.SetOutput(WBS);
  20.         with TCSVParser.Create do
  21.           try
  22.             SetSource(RBS);
  23.             while ParseNextCell do
  24.             begin
  25.               // Actual processing goes here ...
  26.               if CurrentCol = ColumnOfInterest then
  27.                 Builder.AppendCell(CurrentCellText);
  28.               // When we need a new row
  29.               Builder.AppendRow;
  30.             end;
  31.           finally
  32.             Free;
  33.           end;
  34.       finally
  35.         Bulder.Free;
  36.         WBS.Free;
  37.       end;
  38.     finally
  39.       TS.Free;
  40.       RBS.Free;
  41.     end;
  42.   finally
  43.     FS.Free;
  44.   end;
  45. end;
  46.  

I believe you can write the actual processing logic at the marked lines.
IMO that is all you need for processing a single CSV file, it won't get much memory and will run with a decent speed since it has buffered I/O.

Regards,
« Last Edit: July 10, 2021, 09:34:37 pm by y.ivanov »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

Tony Stone

  • Sr. Member
  • ****
  • Posts: 279
Re: Need advice from pros - processing large CSV files
« Reply #18 on: July 11, 2021, 12:11:00 am »
So the user may add several files.  Some of the files I am testing range from 500MB to 3GB.  Some of the files have multiple "columns".  My user will be interested in only saving one particular column to the new "target" file which will more than likeley be the first column of the source CSV files.  Yes these will be static files.  User will have them downloaded from alternate sources... They just need to parse them into a simpler format... basically take the items of interest and put it into a new file...  I have tried stringgrid.loadfrom file and it is painfull slow. lol  My users will be probably on Windows or Linux.  I have been focusing on making my application work in both Windows and Linux... I have no Mac to test with sadly.

I will also look into csvreadwrite as someone else posted.  Thank you!  And to answer that person, I just need to read one line at a time from the file... decide what to do with it(probably half will be written to new file)... then move onto the next source file and do the same thing...

So this looks very useful to me!  Thank you!  Now I was under the impression I could read line by line of a file with just a TFileStream?  Do I need to use the readbuff and writebufff objects?  Wouldnt this just be loading part of the file into memory then parsing it?   So if you could clarify that for me a little that would be great and lead to my final question.

Using a tfilestream(with or without buffer) how would I got about reporting progress back to my main thread?  So far I have just calculated the total size of my files to process in bytes... and set it as a global var... i suppose i just need to know how many bytes i have read so far?  What would be the best way to go about that?

I maybe shouldnt be asking some of these questions yet as i am not there in my code yet. lol... i am getting a bit excited though as i have succesfully implemented a lot of my parsing functions!!  (kind of proud of myself there  :D)

So it pretty straightforward then ... as wp suggested, it is good to use TCSVParser since it parses the file on-the-fly, without loading it in memory. Then you can use TReadBufStream as Fantablup also suggested for speeding up the I/O. Same for the output: use TCSVBuilder and TWriteBufStream.

The snippet should look like this (not compiled nor tested):
Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, csvreadwrite, bufstream;
  3.  
  4. procedure CSVProc(AFilename, ATargetFilename: String; AColumnOfInterest: Integer);
  5. var
  6.   FS, TS: TFileStream;
  7.   RBS: TReadBufStream;
  8.   WBS: TWriteBufStream;
  9.   Builder: TCSVBuilder;
  10. begin
  11.   FS := TFileStream.Create(AFilename, fmOpenRead + fmShareDenyWrite);
  12.   try
  13.     RBS := TReadBufStream.Create(FS);
  14.     TS := TFileStream.Create(ATargetFilename, fmCreate);
  15.     try
  16.       WBS := TWriteBufStream.Create(TS);
  17.       Builder := TCSVBuilder.Create;
  18.       try
  19.         Builder.SetOutput(WBS);
  20.         with TCSVParser.Create do
  21.           try
  22.             SetSource(RBS);
  23.             while ParseNextCell do
  24.             begin
  25.               // Actual processing goes here ...
  26.               if CurrentCol = ColumnOfInterest then
  27.                 Builder.AppendCell(CurrentCellText);
  28.               // When we need a new row
  29.               Builder.AppendRow;
  30.             end;
  31.           finally
  32.             Free;
  33.           end;
  34.       finally
  35.         Bulder.Free;
  36.         WBS.Free;
  37.       end;
  38.     finally
  39.       TS.Free;
  40.       RBS.Free;
  41.     end;
  42.   finally
  43.     FS.Free;
  44.   end;
  45. end;
  46.  

I believe you can write the actual processing logic at the marked lines.
IMO that is all you need for processing a single CSV file, it won't get much memory and will run with a decent speed since it has buffered I/O.

Regards,

Tony Stone

  • Sr. Member
  • ****
  • Posts: 279
Re: Need advice from pros - processing large CSV files
« Reply #19 on: July 11, 2021, 03:41:09 am »
So i spent the past few hours making things work... some what.  At least reading and writing files.  BUT, when i load the large 1.4GB files i watch my applications memory usage go up to over 6GB of memory usage.  I am using basically your example... with read and write buffer.  Maybe I need to set a buffer size some how?  I am currently running Linux Mint with Lazarus 2.0.12


So the user may add several files.  Some of the files I am testing range from 500MB to 3GB.  Some of the files have multiple "columns".  My user will be interested in only saving one particular column to the new "target" file which will more than likeley be the first column of the source CSV files.  Yes these will be static files.  User will have them downloaded from alternate sources... They just need to parse them into a simpler format... basically take the items of interest and put it into a new file...  I have tried stringgrid.loadfrom file and it is painfull slow. lol  My users will be probably on Windows or Linux.  I have been focusing on making my application work in both Windows and Linux... I have no Mac to test with sadly.

I will also look into csvreadwrite as someone else posted.  Thank you!  And to answer that person, I just need to read one line at a time from the file... decide what to do with it(probably half will be written to new file)... then move onto the next source file and do the same thing...

So it pretty straightforward then ... as wp suggested, it is good to use TCSVParser since it parses the file on-the-fly, without loading it in memory. Then you can use TReadBufStream as Fantablup also suggested for speeding up the I/O. Same for the output: use TCSVBuilder and TWriteBufStream.

The snippet should look like this (not compiled nor tested):
Code: Pascal  [Select][+][-]
  1. uses
  2.   Classes, csvreadwrite, bufstream;
  3.  
  4. procedure CSVProc(AFilename, ATargetFilename: String; AColumnOfInterest: Integer);
  5. var
  6.   FS, TS: TFileStream;
  7.   RBS: TReadBufStream;
  8.   WBS: TWriteBufStream;
  9.   Builder: TCSVBuilder;
  10. begin
  11.   FS := TFileStream.Create(AFilename, fmOpenRead + fmShareDenyWrite);
  12.   try
  13.     RBS := TReadBufStream.Create(FS);
  14.     TS := TFileStream.Create(ATargetFilename, fmCreate);
  15.     try
  16.       WBS := TWriteBufStream.Create(TS);
  17.       Builder := TCSVBuilder.Create;
  18.       try
  19.         Builder.SetOutput(WBS);
  20.         with TCSVParser.Create do
  21.           try
  22.             SetSource(RBS);
  23.             while ParseNextCell do
  24.             begin
  25.               // Actual processing goes here ...
  26.               if CurrentCol = ColumnOfInterest then
  27.                 Builder.AppendCell(CurrentCellText);
  28.               // When we need a new row
  29.               Builder.AppendRow;
  30.             end;
  31.           finally
  32.             Free;
  33.           end;
  34.       finally
  35.         Bulder.Free;
  36.         WBS.Free;
  37.       end;
  38.     finally
  39.       TS.Free;
  40.       RBS.Free;
  41.     end;
  42.   finally
  43.     FS.Free;
  44.   end;
  45. end;
  46.  

I believe you can write the actual processing logic at the marked lines.
IMO that is all you need for processing a single CSV file, it won't get much memory and will run with a decent speed since it has buffered I/O.

Regards,
« Last Edit: July 11, 2021, 03:42:54 am by Tony Stone »

Tony Stone

  • Sr. Member
  • ****
  • Posts: 279
Re: Need advice from pros - processing large CSV files
« Reply #20 on: July 11, 2021, 06:19:17 am »
Thank you all!  I have an acceptable and working solution now!

I ended up going with the readbuffer, write buffer and csvparser solution and suggestions recomended by y.ivanov  8-)

Now it is time to clean this mess up!

engkin

  • Hero Member
  • *****
  • Posts: 3112
Re: Need advice from pros - processing large CSV files
« Reply #21 on: July 11, 2021, 06:29:48 am »
In theory your speed limit should be the speed of reading from the HDD. For instance, if just reading the file takes 5 seconds, but your code is taking 50 seconds, probably you can do better.

MarkMLl

  • Hero Member
  • *****
  • Posts: 8525
Re: Need advice from pros - processing large CSV files
« Reply #22 on: July 11, 2021, 09:00:49 am »
Thank you all!  I have an acceptable and working solution now!

I ended up going with the readbuffer, write buffer and csvparser solution and suggestions recomended by y.ivanov  8-)

Now it is time to clean this mess up!

I'd suggest that the key here- which you still haven't really answered- is the extent to which these files are changing or if to some extent the same files are being re-read.

If they're being reused, then it would be worth pre-parsing them into a stable form and possibly generating some sort of index on-the-fly. This would not necessarily use a GUI-based program.

Unless they contain a lot of whitespace, the size suggests that putting them straight into a database might not be a viable solution, since (with a hefty dose of platform-dependency) database representation tends to be longer than the original files.

Just my 2d-worth.

MarkMLl
MT+86 & Turbo Pascal v1 on CCP/M-86, multitasking with LAN & graphics in 128Kb.
Logitech, TopSpeed & FTL Modula-2 on bare metal (Z80, '286 protected mode).
Pet hate: people who boast about the size and sophistication of their computer.
GitHub repositories: https://github.com/MarkMLl?tab=repositories

alpine

  • Hero Member
  • *****
  • Posts: 1412
Re: Need advice from pros - processing large CSV files
« Reply #23 on: July 11, 2021, 02:20:19 pm »
*snip*
Files so far range from 500MB to 3GB or more.  Future could be 10GB files
Windows, Linux, Mac would be cool to target as well.
I am using threads so that it will hopefully be faster, I want my user to continue using my GUI while the parsing is happening in the background.
Not sure what you mean by relations within CSV files.... i really just need to pull one line at a time from the source CSV and write one of its columns to a new text file.
Immutable during processing?  Well, they wont be read or written to by any other programs... they are static.

And if you still wish to involve multi-threading into this, you must be aware that MT adds an additional level of complexity.

In general you should be able to:
  • Create a thread
  • Transfer the arguments to it. The usual procedural parameters doesn't work since each thread has its own separate stack
  • Start the thread
  • Transfer some feedback to the main thread. Again, the procedural parameters won't work, you should use some object fields/properties for that, and you must synchronize with the main LCL thread. Otherwise, bad things can happen
  • Detect when the thread has finished its work
  • Additionally, you may wish to be able to prematurely terminate the thread

I have prepared a small sample how to do it, this time it is a workable one. I hope it is clear enough. See the attachment.
I don't believe threads will make your processing faster, in fact, supplying a feedback to GUI and synchronization will introduce additional delays. Also, you must provide some means of "locking" the related GUI elements until the working thread finishes at least for UI consistency. The only benefit I see is the more responsible UI. 

BTW, from your posts I understand that your program is 64-bit. If that is the case I can't see what are your concerns in regards of available RAM.

Regards,
« Last Edit: July 11, 2021, 02:22:30 pm by y.ivanov »
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

 

TinyPortal © 2005-2018