Recent

Author Topic: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?  (Read 5928 times)

domasz

  • Sr. Member
  • ****
  • Posts: 435
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #60 on: March 03, 2024, 07:15:00 pm »
About twice faster than paweld's version (71 sec for 1 billion):


Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}
  7.   cthreads,
  8.   {$ENDIF}
  9.   Classes, SysUtils
  10.   { you can add units after this };
  11.  
  12. const Seed: Longint = 46668267;
  13.       HowMany = 100000000;    // 100 mil <------------ change here
  14. var F: TFileStream;
  15.     start, stop, stop2: QWord;
  16.     Names: TStringList;
  17.     Names2: array of String;
  18.     Body, Buf, Line: RawByteString;
  19.     A,B: Integer;
  20.     i: Integer;
  21.     Temp, Index: Integer;
  22.     Temp2: Integer;
  23.     Count: Integer;
  24.     k,pp: Integer;
  25.     Prog: Integer;
  26.     Step: Integer;
  27.  
  28.     IntToStr2: array[-999..999] of String;
  29. begin
  30.   F := TFileStream.Create('weather_stations.csv', fmOpenRead);
  31.   start := GetTickCount64;
  32.  
  33.   Names := TStringList.Create;
  34.   Names.Duplicates := dupIgnore;
  35.   Names.Sorted := True;
  36.  
  37.   SetLength(Body, F.Size);
  38.   F.Read(Body[1], F.Size);
  39.  
  40.   F.Free;
  41.  
  42.   A := 0;
  43.   while 1=1 do begin
  44.     B := Pos(';', Body, A);
  45.     if B < A then break;
  46.  
  47.     Line := Copy(Body, A, B-A);
  48.     A := Pos(#10, Body, B)+1;
  49.  
  50.     Names.Add(Line);
  51.   end;
  52.  
  53.   SetLength(Names2, Names.Count);
  54.   for i:=0 to Names.Count-1 do begin
  55.     Names2[i] := Names[i];
  56.   end;
  57.  
  58.  
  59.   stop := GetTickCount64;
  60.  
  61.   WriteLn(Format('Done: Processed %d weather stations in %d ms', [
  62.             Names.Count,
  63.             stop-start
  64.           ]));
  65.   WriteLn;
  66.  
  67.   Count := Names.Count;
  68.  
  69.   RandSeed := Seed;
  70.   Prog := 0;
  71.   Buf := '';
  72.   k := 0;
  73.  
  74.   F := TFileStream.Create('result.txt', fmCreate);
  75.  
  76.   Step := HowMany div 10;
  77.  
  78.   for i:=-999 to 999 do IntToStr2[i] := ';' + Format('%.1f', [i/10]) + #10;
  79.  
  80.   pp := 0;
  81.   for i:=1 to HowMany do begin
  82.     Index := Random(Count);
  83.     //Temp := Random(200)-100;
  84.     //Temp2 := Random(9);
  85.     Temp := Random(999*2)-999;
  86.  
  87.     //Buf := Buf + Names2[Index] + ';' + IntToStr(Temp) + '.' + IntToStr(Temp2) + #10;
  88.     Buf := Buf + Names2[Index] + IntToStr2[Temp];
  89.     Inc(k);
  90.  
  91.     if k > 5000 then begin
  92.       F.Write(Buf[1], Length(Buf));
  93.       Buf := '';
  94.       k := 0;
  95.     end;
  96.  
  97.     Inc(pp);
  98.     if pp > Step then begin
  99.       pp := 0;
  100.       Inc(Prog, 5);
  101.       Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
  102.     end;
  103.   end;
  104.  
  105.   Prog := 50;
  106.   Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
  107.   Writeln;
  108.  
  109.   F.Write(Buf[1], Length(Buf));
  110.   F.Free;
  111.  
  112.   Names.Free;
  113.  
  114.   stop2 := GetTickCount64;
  115.  
  116.  
  117.   WriteLn(Format('Total time: %f s', [    (stop2-start)/1000   ]));
  118.   WriteLn;
  119. end.
  120.  

colo

  • New Member
  • *
  • Posts: 45
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #61 on: March 03, 2024, 07:39:35 pm »
Quote
real    2m59.505s
user    2m39.890s
sys     0m14.236s


Can anyone explain what real, sys, and user mean?

This seems to be one of those things which are not as obvious as they seem to be.

"real" measures elapsed (wallclock) time between invocation and termination. If your program did nothing much but sleep/yield to the OS scheduler for 13 seconds (like `sleep 13` on the shell would), that counter would still indicate ~13 seconds.
"user" measures total on-CPU time spent in user mode, i.e., using the CPU to do pure application-level stuff (like sorting an array in memory, or buffering and formatting output, etc.).
"sys" measures total on-CPU time spent in kernel context (mostly how much CPU time was spent by system calls executed on behalf of the running program, like the time calls to write(2) took to write a few GB of data to some storage in your system).
« Last Edit: March 03, 2024, 07:41:27 pm by colo »

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #62 on: March 03, 2024, 08:38:10 pm »
Hey Y'All

New update via paweld based on code from PascalVault:
Code: Bash  [Select][+][-]
  1. $ time ./bin/generator -i ./data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  2. Input File : /home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv
  3. Output File: /tmp/measurements-1_000_000_000.txt
  4. Line Count : 1,000,000,000
  5.  
  6. Building Weather Stations...
  7. Done: Processed 44,691 entries from a total of 41,343 weather stations in 181 ms
  8.  
  9. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,892,903,901, elapsed: 1 min, 33 sec    
  10.  
  11. Done: file size: 16,892,903,901, elapsed: 1 min, 33 sec
  12.  
  13. real    1m41.492s
  14. user    1m20.240s
  15. sys     0m15.451s

This is a 50% reduction on the previous version!!!

Man, I so effin adore Object Pascal!!!

Cheers,
Gus
« Last Edit: March 03, 2024, 09:39:33 pm by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Warfley

  • Hero Member
  • *****
  • Posts: 1499
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #63 on: March 03, 2024, 10:09:30 pm »
Quote
real    2m59.505s
user    2m39.890s
sys     0m14.236s


Can anyone explain what real, sys, and user mean?

This seems to be one of those things which are not as obvious as they seem to be.


Real = total time on the clock
User = the time that your "user space" code took (i.e. your program)
Sys = time spent on the Kernel/system space, i.e. whenever you do syscalls or do things like writing to the disk, etc

The difference between real and user+sys is the time lost due to process scheduling and context switches. If you have to many short syscalls it can actually be a problem that just waiting for the scheduler takes more time than the actual work

bytebites

  • Hero Member
  • *****
  • Posts: 639
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #64 on: March 04, 2024, 10:23:46 am »
deleted unnecessary code.
« Last Edit: March 04, 2024, 07:48:04 pm by bytebites »

paweld

  • Hero Member
  • *****
  • Posts: 995
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #65 on: March 04, 2024, 10:53:41 am »
I made this change and several others this morning. Current times are 38 seconds in fpc 3.2-fixes and 20 seconds in fpc trunk (different function Random) - I checked on Windows 10
Best regards / Pozdrawiam
paweld

TRon

  • Hero Member
  • *****
  • Posts: 2506
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #66 on: March 05, 2024, 10:44:18 pm »
regarding the challenge, I came across this post

domasz

  • Sr. Member
  • ****
  • Posts: 435
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #67 on: March 05, 2024, 11:10:34 pm »
The code is already well optimized. The only thing I think could help is parallelization.

TRon

  • Hero Member
  • *****
  • Posts: 2506
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #68 on: March 06, 2024, 12:02:11 am »
Most probably domasz.

I am aware that TS focused on the data generator but the linked post shows the process of how to approach these kind of things rather nice, from easy peasy solution to more thoughtful (and on top of that regarding the same topic).

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #69 on: March 06, 2024, 01:14:24 am »
Hey TRon,

regarding the challenge, I came across this post

I've also listed this blog post, also regarding a solution in Go: https://www.bytesizego.com/blog/one-billion-row-challenge-go

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #70 on: March 06, 2024, 06:32:21 pm »
Hey Y'All,

Just to convey the latest stats!!

Latest with FPC main aka trunk
Code: Bash  [Select][+][-]
  1. $ trunk_lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
  2. $ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
  3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  4. Output Filename: "/dev/null"
  5. Line Count: 1,000,000,000
  6.  
  7. Building Weather Stations...
  8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 116 ms
  9.  
  10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 19 sec    
  11.  
  12. Done: file size: 0, elapsed: 0 min, 19 sec
  13.  
  14. real    0m19.504s
  15. user    0m19.450s
  16. sys 0m0.053s
  17. $ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
  20. Line Count: 1,000,000,000
  21.  
  22. Building Weather Stations...
  23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 133 ms
  24.  
  25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 1 min, 1 sec    
  26.  
  27. Done: file size: 16,847,913,227, elapsed: 1 min, 1 sec
  28.  
  29. real    1m1.459s
  30. user    0m22.928s
  31. sys 0m14.753s
  32. $ sha256sum /tmp/measurements-1_000_000_000.txt
  33. ebad17b266ee9f5cb3d118531f197e6f68c9ab988abc5cb9506e6257e1a52ce6  /tmp/measurements-1_000_000_000.txt


Latest with FPC 3.2.2
Code: Bash  [Select][+][-]
  1. $ lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
  2. $ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
  3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  4. Output Filename: "/dev/null"
  5. Line Count: 1,000,000,000
  6.  
  7. Building Weather Stations...
  8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 182 ms
  9.  
  10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 29 sec    
  11.  
  12. Done: file size: 0, elapsed: 0 min, 29 sec
  13.  
  14. real    0m30.052s
  15. user    0m29.996s
  16. sys 0m0.052s
  17. $ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
  20. Line Count: 1,000,000,000
  21.  
  22. Building Weather Stations...
  23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 184 ms
  24.  
  25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 0 min, 57 sec    
  26.  
  27. Done: file size: 16,847,913,227, elapsed: 0 min, 57 sec
  28.  
  29. real    0m57.630s
  30. user    0m31.647s
  31. sys 0m12.971s
  32. $ sha256sum /tmp/measurements-1_000_000_000.txt
  33. ebad17b266ee9f5cb3d118531f197e6f68c9ab988abc5cb9506e6257e1a52ce6  /tmp/measurements-1_000_000_000.txt

Cheers,
Gus
« Last Edit: March 06, 2024, 06:34:44 pm by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

vfclists

  • Hero Member
  • *****
  • Posts: 1013
    • HowTos Considered Harmful?
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #71 on: March 06, 2024, 09:44:44 pm »
Hey Y'All,

Just to convey the latest stats!!

Latest with FPC main aka trunk
Code: Bash  [Select][+][-]
  1. $ trunk_lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
  2. $ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
  3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  4. Output Filename: "/dev/null"
  5. Line Count: 1,000,000,000
  6.  
  7. Building Weather Stations...
  8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 116 ms
  9.  
  10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 19 sec    
  11.  
  12. Done: file size: 0, elapsed: 0 min, 19 sec
  13.  
  14. real    0m19.504s
  15. user    0m19.450s
  16. sys 0m0.053s
  17. $ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
  20. Line Count: 1,000,000,000
  21.  
  22. Building Weather Stations...
  23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 133 ms
  24.  
  25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 1 min, 1 sec    
  26.  
  27. Done: file size: 16,847,913,227, elapsed: 1 min, 1 sec
  28.  
  29. real    1m1.459s
  30. user    0m22.928s
  31. sys 0m14.753s
  32. $ sha256sum /tmp/measurements-1_000_000_000.txt
  33. ebad17b266ee9f5cb3d118531f197e6f68c9ab988abc5cb9506e6257e1a52ce6  /tmp/measurements-1_000_000_000.txt


Latest with FPC 3.2.2
Code: Bash  [Select][+][-]
  1. $ lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
  2. $ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
  3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  4. Output Filename: "/dev/null"
  5. Line Count: 1,000,000,000
  6.  
  7. Building Weather Stations...
  8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 182 ms
  9.  
  10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 29 sec    
  11.  
  12. Done: file size: 0, elapsed: 0 min, 29 sec
  13.  
  14. real    0m30.052s
  15. user    0m29.996s
  16. sys 0m0.052s
  17. $ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
  19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
  20. Line Count: 1,000,000,000
  21.  
  22. Building Weather Stations...
  23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 184 ms
  24.  
  25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 0 min, 57 sec    
  26.  
  27. Done: file size: 16,847,913,227, elapsed: 0 min, 57 sec
  28.  
  29. real    0m57.630s
  30. user    0m31.647s
  31. sys 0m12.971s
  32. $ sha256sum /tmp/measurements-1_000_000_000.txt
  33. ebad17b266ee9f5cb3d118531f197e6f68c9ab988abc5cb9506e6257e1a52ce6  /tmp/measurements-1_000_000_000.txt

Cheers,
Gus

You should run them in a profiler to find out which parts of FPC have improved.
Lazarus 3.0/FPC 3.2.2

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #72 on: March 08, 2024, 10:38:58 pm »
Hey vfclists,

You should run them in a profiler to find out which parts of FPC have improved.

At this point I'm considering the thing finished, so I'm now focused on trying to get some submissions coming in.

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1119
  • Professional amateur ;-P
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #73 on: March 08, 2024, 10:44:10 pm »
Hey Y'All,

And, would you look at that!!!

Our beloved GetMem sent me the first submission via email!!

Code: Bash  [Select][+][-]
  1. $ time ./bin/sbalazs /tmp/measurements-1_000_000_000.txt out.txt 30
  2. Process started at: 21:19:31:915
  3. Process finished at: 21:20:30:621
  4. Duration: 58706 ms
  5.  
  6. real    0m58.710s
  7. user    16m33.586s
  8. sys     2m6.977s

This is just me jumping the gun with his code. Ran it once with time and not hyperfine.
Still need to make it compliant to use STDOUT and not a file.

Not sure what I'm gonna do with that thread count param...

I'm also thinking of writing a blog post announcing the thing, and creating another thread for the challenge itself!!

And I've been lazy by not providing an official baseline implementation. Something that is the most naive, one threaded and sequential as possible :D

Welp, stuff for another day !!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

BeniBela

  • Hero Member
  • *****
  • Posts: 906
    • homepage
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #74 on: March 09, 2024, 11:02:08 am »
>and emits the results on STDOUT like this (i.e. sorted alphabetically by station name

what does alphabetically mean? case-sensitive? case-insensitive? Unicode-aware?

 

TinyPortal © 2005-2018