Recent

Author Topic: Suggestions please  (Read 2778 times)

liewald

  • Full Member
  • ***
  • Posts: 142
Suggestions please
« on: June 05, 2020, 09:40:14 am »
I have some pretty large csv files (up to 0.5 tb, medical research!) that I need to extract some fields from. Way to large to read into any mem database solution I really need some sort of paged disk based csv solution. Anyone go any ideas?

Dave

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Suggestions please
« Reply #1 on: June 05, 2020, 09:57:16 am »
Use TCSVParser in Unit csvreadwrite; it reads only what is needed for the current cell.

liewald

  • Full Member
  • ***
  • Posts: 142
Re: Suggestions please
« Reply #2 on: June 05, 2020, 11:45:44 am »
had a look at code not quite sure how that would work.

basically what I want to do is

1. read a record
2. extract fields  A,D,F,G, form fields A..Z
3. add these fields to another dataset

this seams to read fields sequentially which will take forever with these files

DonAlfredo

  • Hero Member
  • *****
  • Posts: 1739
Re: Suggestions please
« Reply #3 on: June 05, 2020, 11:58:38 am »
You could memory-map the whole file into RAM.

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Suggestions please
« Reply #4 on: June 05, 2020, 12:02:17 pm »
I don't think there is another way with CSV files. All data will be read anyway, since every block on disk probably contains some of these fields.

Moreover, even if it weren't (e.g. if you have really long, multi kb lines), then the variable size makes anything hard.

With indexing you can optimize multiple passes, (e.g. keep an index by some major parameter, like time), but it is laborous, and only useful if you do this multiple times with varying parameters.

howardpc

  • Hero Member
  • *****
  • Posts: 4144
Re: Suggestions please
« Reply #5 on: June 05, 2020, 01:09:52 pm »
...
this seams to read fields sequentially which will take forever with these files
As Marco points out, there is no other way to read and interpret CSV files than by sequential reading. The position of each comma (or other separator) and the position of each line marker is unknown until it is discovered by reading. And usually how many separators to look for in each line is also unknown.

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Suggestions please
« Reply #6 on: June 05, 2020, 01:43:56 pm »
had a look at code not quite sure how that would work.
Sorry I did not find a good example code for it. Basically the out-dated wiki article on CSVDocument (https://wiki.freepascal.org/CsvDocument) contains everything needed in the sample procedure "LoadGridFromCSVFile()", but of course you must skip the Grid part here, accessing the file is the same.

440bx

  • Hero Member
  • *****
  • Posts: 3946
Re: Suggestions please
« Reply #7 on: June 05, 2020, 02:16:46 pm »
2. extract fields  A,D,F,G, form fields A..Z
3. add these fields to another dataset
As Don Alfredo suggested, mapping the file in memory is likely to be the best and simplest solution.

As far as to how quickly a 500GB (half terabyte) file can be scanned, that depends mostly on the average number of characters per line.  Presuming that every field has 30 characters and there are 26 of them (A..Z) that gives 780 characters average per line.

The beauty is, you are only interested in 4 fields and, field G is only the seventh field.  After you've extracted that one (scan  210 characters checking for commas to identify the fields up to and including field G), you can simply search for the EOL (no testing/looking for commas anymore) and start over with the new line.

A decent machine - _not_ a top of the line one - should be able to scan about 300MB per seconds like that (ballpark estimate.) A 500GB file would take about 28 minutes.  A really fast, top of the line machine might be able to cut that in less than half but,  it will still take in the order of minutes to scan that large a file.  Having the file on an SSD would very likely result in a noticeably shorter amount of time too.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

hrayon

  • Full Member
  • ***
  • Posts: 118
Re: Suggestions please
« Reply #8 on: June 05, 2020, 02:26:35 pm »
Hi!
If size is a problem, why not split it?
Linux has a "split" command.
You can split the file into multiple pieces based on the number of lines using -l option.

wp

  • Hero Member
  • *****
  • Posts: 11858
Re: Suggestions please
« Reply #9 on: June 05, 2020, 03:00:07 pm »
had a look at code not quite sure how that would work.
Sorry I did not find a good example code for it. Basically the out-dated wiki article on CSVDocument (https://wiki.freepascal.org/CsvDocument) contains everything needed in the sample procedure "LoadGridFromCSVFile()", but of course you must skip the Grid part here, accessing the file is the same.
Find a working demo in the attachment. It reads today's file of confirmed CoVid-19 infections provided by the Johns Hopkins University (https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv) and calculates the sum of the cases reported for March 31 for all countries. Of course, this is a small file. It would be interesting to see the performance for a large file.
« Last Edit: June 05, 2020, 03:18:18 pm by wp »

marcov

  • Administrator
  • Hero Member
  • *
  • Posts: 11383
  • FPC developer.
Re: Suggestions please
« Reply #10 on: June 05, 2020, 03:21:33 pm »
Afaik, memory mapping is mostly useful for random access, less for sequential reads when compared to normal reads with large blocksizes.

The reason that with random access, you basically allow the OS to cache old blocks for reuse, but at the same time giving it a chance to discard it when necessary, allowing to maximize memory use without hindering other apps (too much)

440bx

  • Hero Member
  • *****
  • Posts: 3946
Re: Suggestions please
« Reply #11 on: June 05, 2020, 10:33:59 pm »
Afaik, memory mapping is mostly useful for random access, less for sequential reads when compared to normal reads with large blocksizes.

The reason that with random access, you basically allow the OS to cache old blocks for reuse, but at the same time giving it a chance to discard it when necessary, allowing to maximize memory use without hindering other apps (too much)
There is plenty of truth to that but, at least under Windows, the programmer can specify the access (sequential or random) when opening the file (using CreateFile) and, from what I've seen so far, Windows seems to handle it quite well.

In his case, mapping the entire file (instead of mapping sections of it) makes the code simple and if "sequential scan" is specified in the call to CreateFile, performance should be good.  In addition to that, it clues the O/S that it can reuse the same physical pages it used for the last read,  something very important when dealing with files that large.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

zamronypj

  • Full Member
  • ***
  • Posts: 133
    • Fano Framework, Free Pascal web application framework
Re: Suggestions please
« Reply #12 on: June 06, 2020, 01:45:41 am »
I have some pretty large csv files (up to 0.5 tb, medical research!) that I need to extract some fields from. Way to large to read into any mem database solution I really need some sort of paged disk based csv solution. Anyone go any ideas?

Dave

Is there any reason you do not want to use RDBMS? MySQL, PostgreSQL provides a way for developer to import data from CSV.
Fano Framework, Free Pascal web application framework https://fanoframework.github.io
Apache module executes Pascal program like scripting language https://zamronypj.github.io/mod_pascal/
Github https://github.com/zamronypj

egsuh

  • Hero Member
  • *****
  • Posts: 1273
Re: Suggestions please
« Reply #13 on: June 06, 2020, 11:25:03 am »
Not sure what you exactly want to do, but extracting just whatever fields you need is not a difficult job. Very traditional way would work.

Code: Pascal  [Select][+][-]
  1. var
  2.      fin, fout : TextFile;
  3.      strlst, strout : TStrignList;
  4.      s : string;
  5. begin
  6.      AssignFile (fin, inputfilename);
  7.      AssignFile (fout, outputfilename);
  8.      Reset(fin);
  9.      Rewrite (fout);
  10.  
  11.      strlst := TStringList.Create;
  12.      strout := TStringList.Create;
  13.  
  14.      while not eof(fin) do begin
  15.            Readln (fin, s);
  16.            strlst.commatext := s;
  17.            
  18.            strout.clear;   // Sorry. I forgot this.
  19.  
  20.            strout.Append(strlst[0]);
  21.            strout.Append(strlst[3]);
  22.            strout.Append(strlst[5]);
  23.            strout.Append(strlst[6]);
  24.            
  25.            Writeln (fout, strout.CommaText);
  26.     end;
  27.     strlst.free;
  28.     strout.free;
  29.  
  30.      CloseFile(fin);
  31.      CloseFile (fout);
  32. end;  

I put the fields to another text file (csv file which would hopefully be manageable size), but you can make it into  any type you want. 
« Last Edit: June 08, 2020, 03:47:34 am by egsuh »

440bx

  • Hero Member
  • *****
  • Posts: 3946
Re: Suggestions please
« Reply #14 on: June 06, 2020, 11:41:35 am »
I put the fields to another text file (csv file which would hopefully be manageable size), but you can make it into  any type you want.
From the OP, I get the feeling the problem is that some of his csv files are rather large (half a terabyte is large).   I'd be concerned that the "strout" stringlist will, either not be able to handle the volume or, become prohibitively slow.

OTH, it's probably worth a try.  If the stringlist can handle it graciously then you've offered a simple solution.

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

 

TinyPortal © 2005-2018