### Bookstore

 Computer Math and Games in Pascal (preview) Lazarus Handbook

### Author Topic: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?  (Read 5911 times)

#### domasz

• Sr. Member
• Posts: 434
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #60 on: March 03, 2024, 07:15:00 pm »
About twice faster than paweld's version (71 sec for 1 billion):

Code: Pascal  [Select][+][-]
1. program project1;
2.
3. {\$mode objfpc}{\$H+}
4.
5. uses
6.   {\$IFDEF UNIX}
8.   {\$ENDIF}
9.   Classes, SysUtils
10.   { you can add units after this };
11.
12. const Seed: Longint = 46668267;
13.       HowMany = 100000000;    // 100 mil <------------ change here
14. var F: TFileStream;
15.     start, stop, stop2: QWord;
16.     Names: TStringList;
17.     Names2: array of String;
18.     Body, Buf, Line: RawByteString;
19.     A,B: Integer;
20.     i: Integer;
21.     Temp, Index: Integer;
22.     Temp2: Integer;
23.     Count: Integer;
24.     k,pp: Integer;
25.     Prog: Integer;
26.     Step: Integer;
27.
28.     IntToStr2: array[-999..999] of String;
29. begin
31.   start := GetTickCount64;
32.
33.   Names := TStringList.Create;
34.   Names.Duplicates := dupIgnore;
35.   Names.Sorted := True;
36.
37.   SetLength(Body, F.Size);
39.
40.   F.Free;
41.
42.   A := 0;
43.   while 1=1 do begin
44.     B := Pos(';', Body, A);
45.     if B < A then break;
46.
47.     Line := Copy(Body, A, B-A);
48.     A := Pos(#10, Body, B)+1;
49.
51.   end;
52.
53.   SetLength(Names2, Names.Count);
54.   for i:=0 to Names.Count-1 do begin
55.     Names2[i] := Names[i];
56.   end;
57.
58.
59.   stop := GetTickCount64;
60.
61.   WriteLn(Format('Done: Processed %d weather stations in %d ms', [
62.             Names.Count,
63.             stop-start
64.           ]));
65.   WriteLn;
66.
67.   Count := Names.Count;
68.
69.   RandSeed := Seed;
70.   Prog := 0;
71.   Buf := '';
72.   k := 0;
73.
74.   F := TFileStream.Create('result.txt', fmCreate);
75.
76.   Step := HowMany div 10;
77.
78.   for i:=-999 to 999 do IntToStr2[i] := ';' + Format('%.1f', [i/10]) + #10;
79.
80.   pp := 0;
81.   for i:=1 to HowMany do begin
82.     Index := Random(Count);
83.     //Temp := Random(200)-100;
84.     //Temp2 := Random(9);
85.     Temp := Random(999*2)-999;
86.
87.     //Buf := Buf + Names2[Index] + ';' + IntToStr(Temp) + '.' + IntToStr(Temp2) + #10;
88.     Buf := Buf + Names2[Index] + IntToStr2[Temp];
89.     Inc(k);
90.
91.     if k > 5000 then begin
92.       F.Write(Buf[1], Length(Buf));
93.       Buf := '';
94.       k := 0;
95.     end;
96.
97.     Inc(pp);
98.     if pp > Step then begin
99.       pp := 0;
100.       Inc(Prog, 5);
101.       Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
102.     end;
103.   end;
104.
105.   Prog := 50;
106.   Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
107.   Writeln;
108.
109.   F.Write(Buf[1], Length(Buf));
110.   F.Free;
111.
112.   Names.Free;
113.
114.   stop2 := GetTickCount64;
115.
116.
117.   WriteLn(Format('Total time: %f s', [    (stop2-start)/1000   ]));
118.   WriteLn;
119. end.
120.

#### colo

• New Member
• Posts: 45
##### Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #61 on: March 03, 2024, 07:39:35 pm »
Quote
real    2m59.505s
user    2m39.890s
sys     0m14.236s

Can anyone explain what real, sys, and user mean?

This seems to be one of those things which are not as obvious as they seem to be.

"real" measures elapsed (wallclock) time between invocation and termination. If your program did nothing much but sleep/yield to the OS scheduler for 13 seconds (like `sleep 13` on the shell would), that counter would still indicate ~13 seconds.
"user" measures total on-CPU time spent in user mode, i.e., using the CPU to do pure application-level stuff (like sorting an array in memory, or buffering and formatting output, etc.).
"sys" measures total on-CPU time spent in kernel context (mostly how much CPU time was spent by system calls executed on behalf of the running program, like the time calls to write(2) took to write a few GB of data to some storage in your system).
« Last Edit: March 03, 2024, 07:41:27 pm by colo »

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 1119
• Professional amateur ;-P
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #62 on: March 03, 2024, 08:38:10 pm »
Hey Y'All

New update via paweld based on code from PascalVault:
Code: Bash  [Select][+][-]
1. \$ time ./bin/generator -i ./data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
2. Input File : /home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv
3. Output File: /tmp/measurements-1_000_000_000.txt
4. Line Count : 1,000,000,000
5.
6. Building Weather Stations...
7. Done: Processed 44,691 entries from a total of 41,343 weather stations in 181 ms
8.
9. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,892,903,901, elapsed: 1 min, 33 sec
10.
11. Done: file size: 16,892,903,901, elapsed: 1 min, 33 sec
12.
13. real    1m41.492s
14. user    1m20.240s
15. sys     0m15.451s

This is a 50% reduction on the previous version!!!

Man, I so effin adore Object Pascal!!!

Cheers,
Gus
« Last Edit: March 03, 2024, 09:39:33 pm by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

#### Warfley

• Hero Member
• Posts: 1499
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #63 on: March 03, 2024, 10:09:30 pm »
Quote
real    2m59.505s
user    2m39.890s
sys     0m14.236s

Can anyone explain what real, sys, and user mean?

This seems to be one of those things which are not as obvious as they seem to be.

Real = total time on the clock
User = the time that your "user space" code took (i.e. your program)
Sys = time spent on the Kernel/system space, i.e. whenever you do syscalls or do things like writing to the disk, etc

The difference between real and user+sys is the time lost due to process scheduling and context switches. If you have to many short syscalls it can actually be a problem that just waiting for the scheduler takes more time than the actual work

#### bytebites

• Hero Member
• Posts: 639
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #64 on: March 04, 2024, 10:23:46 am »
deleted unnecessary code.
« Last Edit: March 04, 2024, 07:48:04 pm by bytebites »

#### paweld

• Hero Member
• Posts: 991
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #65 on: March 04, 2024, 10:53:41 am »
I made this change and several others this morning. Current times are 38 seconds in fpc 3.2-fixes and 20 seconds in fpc trunk (different function Random) - I checked on Windows 10
Best regards / Pozdrawiam
paweld

#### TRon

• Hero Member
• Posts: 2503
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #66 on: March 05, 2024, 10:44:18 pm »
regarding the challenge, I came across this post

#### domasz

• Sr. Member
• Posts: 434
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #67 on: March 05, 2024, 11:10:34 pm »
The code is already well optimized. The only thing I think could help is parallelization.

#### TRon

• Hero Member
• Posts: 2503
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #68 on: March 06, 2024, 12:02:11 am »
Most probably domasz.

I am aware that TS focused on the data generator but the linked post shows the process of how to approach these kind of things rather nice, from easy peasy solution to more thoughtful (and on top of that regarding the same topic).

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 1119
• Professional amateur ;-P
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #69 on: March 06, 2024, 01:14:24 am »
Hey TRon,

regarding the challenge, I came across this post

I've also listed this blog post, also regarding a solution in Go: https://www.bytesizego.com/blog/one-billion-row-challenge-go

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 1119
• Professional amateur ;-P
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #70 on: March 06, 2024, 06:32:21 pm »
Hey Y'All,

Just to convey the latest stats!!

Latest with FPC main aka trunk
Code: Bash  [Select][+][-]
1. \$ trunk_lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
2. \$ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
4. Output Filename: "/dev/null"
5. Line Count: 1,000,000,000
6.
7. Building Weather Stations...
8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 116 ms
9.
10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 19 sec
11.
12. Done: file size: 0, elapsed: 0 min, 19 sec
13.
14. real    0m19.504s
15. user    0m19.450s
16. sys 0m0.053s
17. \$ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
20. Line Count: 1,000,000,000
21.
22. Building Weather Stations...
23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 133 ms
24.
25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 1 min, 1 sec
26.
27. Done: file size: 16,847,913,227, elapsed: 1 min, 1 sec
28.
29. real    1m1.459s
30. user    0m22.928s
31. sys 0m14.753s
32. \$ sha256sum /tmp/measurements-1_000_000_000.txt

Latest with FPC 3.2.2
Code: Bash  [Select][+][-]
1. \$ lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
2. \$ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
4. Output Filename: "/dev/null"
5. Line Count: 1,000,000,000
6.
7. Building Weather Stations...
8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 182 ms
9.
10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 29 sec
11.
12. Done: file size: 0, elapsed: 0 min, 29 sec
13.
14. real    0m30.052s
15. user    0m29.996s
16. sys 0m0.052s
17. \$ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
20. Line Count: 1,000,000,000
21.
22. Building Weather Stations...
23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 184 ms
24.
25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 0 min, 57 sec
26.
27. Done: file size: 16,847,913,227, elapsed: 0 min, 57 sec
28.
29. real    0m57.630s
30. user    0m31.647s
31. sys 0m12.971s
32. \$ sha256sum /tmp/measurements-1_000_000_000.txt

Cheers,
Gus
« Last Edit: March 06, 2024, 06:34:44 pm by Gustavo 'Gus' Carreno »
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

#### vfclists

• Hero Member
• Posts: 1013
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #71 on: March 06, 2024, 09:44:44 pm »
Hey Y'All,

Just to convey the latest stats!!

Latest with FPC main aka trunk
Code: Bash  [Select][+][-]
1. \$ trunk_lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
2. \$ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
4. Output Filename: "/dev/null"
5. Line Count: 1,000,000,000
6.
7. Building Weather Stations...
8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 116 ms
9.
10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 19 sec
11.
12. Done: file size: 0, elapsed: 0 min, 19 sec
13.
14. real    0m19.504s
15. user    0m19.450s
16. sys 0m0.053s
17. \$ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
20. Line Count: 1,000,000,000
21.
22. Building Weather Stations...
23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 133 ms
24.
25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 1 min, 1 sec
26.
27. Done: file size: 16,847,913,227, elapsed: 1 min, 1 sec
28.
29. real    1m1.459s
30. user    0m22.928s
31. sys 0m14.753s
32. \$ sha256sum /tmp/measurements-1_000_000_000.txt

Latest with FPC 3.2.2
Code: Bash  [Select][+][-]
1. \$ lazbuild -B --bm=Release generator/Lazarus/src/generator.lpi
2. \$ time ./bin/generator -i data/weather_stations.csv -o /dev/null -n 1_000_000_000
3. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
4. Output Filename: "/dev/null"
5. Line Count: 1,000,000,000
6.
7. Building Weather Stations...
8. Done: Processed 44,691 entries from a total of 41,343 weather stations in 182 ms
9.
10. [##################################################] 100.00 % lines: 1,000,000,000, file size: 0, elapsed: 0 min, 29 sec
11.
12. Done: file size: 0, elapsed: 0 min, 29 sec
13.
14. real    0m30.052s
15. user    0m29.996s
16. sys 0m0.052s
17. \$ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
18. Input Filename: "/home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv"
19. Output Filename: "/tmp/measurements-1_000_000_000.txt"
20. Line Count: 1,000,000,000
21.
22. Building Weather Stations...
23. Done: Processed 44,691 entries from a total of 41,343 weather stations in 184 ms
24.
25. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,847,913,227, elapsed: 0 min, 57 sec
26.
27. Done: file size: 16,847,913,227, elapsed: 0 min, 57 sec
28.
29. real    0m57.630s
30. user    0m31.647s
31. sys 0m12.971s
32. \$ sha256sum /tmp/measurements-1_000_000_000.txt

Cheers,
Gus

You should run them in a profiler to find out which parts of FPC have improved.
Lazarus 3.0/FPC 3.2.2

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 1119
• Professional amateur ;-P
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #72 on: March 08, 2024, 10:38:58 pm »
Hey vfclists,

You should run them in a profiler to find out which parts of FPC have improved.

At this point I'm considering the thing finished, so I'm now focused on trying to get some submissions coming in.

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

#### Gustavo 'Gus' Carreno

• Hero Member
• Posts: 1119
• Professional amateur ;-P
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #73 on: March 08, 2024, 10:44:10 pm »
Hey Y'All,

And, would you look at that!!!

Our beloved GetMem sent me the first submission via email!!

Code: Bash  [Select][+][-]
1. \$ time ./bin/sbalazs /tmp/measurements-1_000_000_000.txt out.txt 30
2. Process started at: 21:19:31:915
3. Process finished at: 21:20:30:621
4. Duration: 58706 ms
5.
6. real    0m58.710s
7. user    16m33.586s
8. sys     2m6.977s

This is just me jumping the gun with his code. Ran it once with time and not hyperfine.
Still need to make it compliant to use STDOUT and not a file.

Not sure what I'm gonna do with that thread count param...

I'm also thinking of writing a blog post announcing the thing, and creating another thread for the challenge itself!!

And I've been lazy by not providing an official baseline implementation. Something that is the most naive, one threaded and sequential as possible

Welp, stuff for another day !!!

Cheers,
Gus
Lazarus 3.99(main) FPC 3.3.1(main) Ubuntu 23.10 64b Dark Theme
Lazarus 3.0.0(stable) FPC 3.2.2(stable) Ubuntu 23.10 64b Dark Theme
http://github.com/gcarreno

#### BeniBela

• Hero Member
• Posts: 906
##### Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #74 on: March 09, 2024, 11:02:08 am »
>and emits the results on STDOUT like this (i.e. sorted alphabetically by station name

what does alphabetically mean? case-sensitive? case-insensitive? Unicode-aware?