Recent

Author Topic: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?  (Read 9899 times)

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #45 on: March 03, 2024, 04:09:05 pm »
Hey Warfley,

Are these good in order to use perf?

Cheers,
Gus

alpine

  • Hero Member
  • *****
  • Posts: 1412
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #46 on: March 03, 2024, 04:10:03 pm »

The access of the weather station list is then accessed via a randomly generated index, not by it's name.
So it makes no sense that I use any sort of hash based container, since I'm not using a string to get the name, amirite?! :D

I was under impression that you expect a duplicates into the input file and because of that you're using a sorted string list.
If you don't have duplicates, why you'd set explicitly the
Code: Pascal  [Select][+][-]
  1.   FStationNames.Duplicates:= dupIgnore;
  2.   FStationNames.Sorted:= True;
  in the first place?
"I'm sorry Dave, I'm afraid I can't do that."
—HAL 9000

paweld

  • Hero Member
  • *****
  • Posts: 1579
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #47 on: March 03, 2024, 04:13:23 pm »
I made some changes: https://github.com/paweld/1brc-ObjectPascal
and with me the results are as follows:
Code: [Select]
- on Debian 12 x64 (VM):
  * python: 6 minutes 22 seconds
  * pascal: 3 minutes 7 seconds

- on Windows 10 x64
  * python: 9 minutes 29 seconds
  * pascal: 2 minutes 51 seconds
Best regards / Pozdrawiam
paweld

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #48 on: March 03, 2024, 04:23:30 pm »
Hey Alpine,


The access of the weather station list is then accessed via a randomly generated index, not by it's name.
So it makes no sense that I use any sort of hash based container, since I'm not using a string to get the name, amirite?! :D

I was under impression that you expect a duplicates into the input file and because of that you're using a sorted string list.
If you don't have duplicates, why you'd set explicitly the
Code: Pascal  [Select][+][-]
  1.   FStationNames.Duplicates:= dupIgnore;
  2.   FStationNames.Sorted:= True;
  in the first place?

I did not even attempted to look at the input file to see if there are duplicates.
I'm assuming that there are duplicates, and I'm sorting them so we have the same list for every run.

One thing is the building of the list, and then another thing is the access of it.
  • I build the list once, sorted and de-duped.
  • I access that built list with a randomly generated index.

Hope this makes more sense and we can move past the list generation :D

Cheers,
Gus

Thaddy

  • Hero Member
  • *****
  • Posts: 18792
  • Glad to be alive.
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #49 on: March 03, 2024, 04:24:32 pm »
IF I succeed it will be here first. 8-)
Recovered from removal of tumor in tongue following tongue reconstruction with a part from my leg.

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #50 on: March 03, 2024, 04:42:40 pm »
Hey Thaddy

IF I succeed it will be here first. 8-)

No probs!!
Here, there, anywhere, I'll still have the opportunity to look at your code  :P

Cheers,
Gus

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #51 on: March 03, 2024, 04:44:05 pm »
Hey Y'All,

I've released v0.1 with the `valgrind` file.

Happy profiling !!

Cheers,
Gus

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #52 on: March 03, 2024, 05:09:06 pm »
Hey Y'All,

STOP THE PRESSES!!!!

@paweld has vindicated our honour:
Code: Bash  [Select][+][-]
  1. $ time ./bin/generator -i data/weather_stations.csv -o /tmp/measurements-1_000_000_000.txt -n 1_000_000_000
  2. Input File : /home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv
  3. Output File: /tmp/measurements-1_000_000_000.txt
  4. Line Count : 1,000,000,000
  5.  
  6. Building Weather Stations...
  7. Done: Processed 44,691 entries from a total of 41,343 weather stations in 180 ms
  8.  
  9. [##################################################] 100.00 %
  10.  
  11. real    3m4.742s
  12. user    2m50.476s
  13. sys    0m14.223s

This is HALF the time that Python took on my computer, IIRC!!

Hurray for @paweld !!!!!!!!!1

Cheers,
Gus

vfclists

  • Hero Member
  • *****
  • Posts: 1165
    • HowTos Considered Harmful?
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #53 on: March 03, 2024, 05:28:49 pm »
I made some changes: https://github.com/paweld/1brc-ObjectPascal
and with me the results are as follows:
Code: [Select]
- on Debian 12 x64 (VM):
  * python: 6 minutes 22 seconds
  * pascal: 3 minutes 7 seconds

- on Windows 10 x64
  * python: 9 minutes 29 seconds
  * pascal: 2 minutes 51 seconds

Is the VM on the Windows machine?
Lazarus 3.0/FPC 3.2.2

paweld

  • Hero Member
  • *****
  • Posts: 1579
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #54 on: March 03, 2024, 05:45:11 pm »
Quote from: vfclists
Is the VM on the Windows machine?
Yes, the host is Windows 10 x64, the VM manager is VirtualBox
Best regards / Pozdrawiam
paweld

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #55 on: March 03, 2024, 06:13:22 pm »
Hey Y'All,

Even when I run the generator on my HDD, we still have the same result:
Code: Bash  [Select][+][-]
  1. $ time ./bin/generator -i ./data/weather_stations.csv -o ./data/measurements-1_000_000_000.txt -n 1_000_000_000
  2. Input File : /home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv
  3. Output File: /home/gcarreno/Programming/1brc-ObjectPascal/data/measurements-1_000_000_000.txt
  4. Line Count : 1,000,000,000
  5.  
  6. Building Weather Stations...
  7. Done: Processed 44,691 entries from a total of 41,343 weather stations in 178 ms
  8.  
  9. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,892,043,902, elapsed: 2 min, 56 sec    
  10.  
  11. Done: file size: 16,892,043,902, elapsed: 2 min, 56 sec
  12.  
  13. real    2m59.505s
  14. user    2m39.890s
  15. sys    0m14.236s

We have completely eliminated file I/O and made it about memory manipulation !!!

Just BEE-YOU-TEE-FULL !!!

Again, @paweld, immense kudos !!

Cheers,
Gus

JuhaManninen

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 4695
  • I like bugs.
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #56 on: March 03, 2024, 06:16:06 pm »
I'm assuming that there are duplicates, and I'm sorting them so we have the same list for every run.
If I understood right, you set Sorted:= True; before adding to items.
As mentioned by Alpine and myself, adding items to a long sorted list is VERY slow.
I don't know why you ignore this obvious place for optimization.
Better is if you sort the list after adding all items. Even then it is slower than a good Map container.

Around 3 minutes is not very fast in today's computers. You can do better.
Python is known to be slow. It has other advantages, not speed.
« Last Edit: March 03, 2024, 06:18:15 pm by JuhaManninen »
Mostly Lazarus trunk and FPC 3.2 on Manjaro Linux 64-bit.

domasz

  • Hero Member
  • *****
  • Posts: 618
Re: [SOLVED] Why can't I beat Python on generating 1 Billion Rows?
« Reply #57 on: March 03, 2024, 06:26:23 pm »
My attempt below. In my tests similar paweld's.


Code: Pascal  [Select][+][-]
  1. program project1;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   {$IFDEF UNIX}
  7.   cthreads,
  8.   {$ENDIF}
  9.   Classes, SysUtils
  10.   { you can add units after this };
  11.  
  12. const Seed: Longint = 46668267;
  13.       HowMany = 100000000;    //100 mil, not 1 bil         <------------ change here
  14. var F: TFileStream;
  15.     start, stop, stop2: QWord;
  16.     Names: TStringList;
  17.     Body, Buf, Line: RawByteString;
  18.     A,B: Integer;
  19.     i: Integer;
  20.     Temp, Index: Integer;
  21.     Temp2: Integer;
  22.     Count: Integer;
  23.     k,pp: Integer;
  24.     Prog: Integer;
  25.     Step: Integer;
  26. begin
  27.   F := TFileStream.Create('weather_stations.csv', fmOpenRead);
  28.   start := GetTickCount64;
  29.  
  30.   Names := TStringList.Create;
  31.   Names.Duplicates := dupIgnore;
  32.   Names.Sorted := True;
  33.  
  34.   SetLength(Body, F.Size);
  35.   F.Read(Body[1], F.Size);
  36.  
  37.   F.Free;
  38.  
  39.   A := 0;
  40.   while 1=1 do begin
  41.     B := Pos(';', Body, A);
  42.     if B < A then break;
  43.  
  44.     Line := Copy(Body, A, B-A);
  45.     A := Pos(#10, Body, B)+1;
  46.  
  47.     Names.Add(Line);
  48.   end;
  49.  
  50.  
  51.   stop := GetTickCount64;
  52.  
  53.   WriteLn(Format('Done: Processed %d weather stations in %d ms', [
  54.             Names.Count,
  55.             stop-start
  56.           ]));
  57.   WriteLn;
  58.  
  59.   Count := Names.Count;
  60.  
  61.   RandSeed := Seed;
  62.   Prog := 0;
  63.   Buf := '';
  64.   k := 0;
  65.  
  66.   F := TFileStream.Create('result.txt', fmCreate);
  67.  
  68.   Step := HowMany div 10;
  69.  
  70.   pp := 0;
  71.   for i:=1 to HowMany do begin
  72.     Index := Random(Count);
  73.     Temp := Random(200)-100;
  74.     Temp2 := Random(9);
  75.  
  76.     Buf := Buf + Names[Index] + ';' + IntToStr(Temp) + '.' + IntToStr(Temp2) + #10;
  77.     Inc(k);
  78.  
  79.     if k > 5000 then begin
  80.       F.Write(Buf[1], Length(Buf));
  81.       Buf := '';
  82.       k := 0;
  83.     end;
  84.  
  85.     Inc(pp);
  86.     if pp > Step then begin
  87.       pp := 0;
  88.       Inc(Prog, 5);
  89.       Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
  90.     end;
  91.   end;
  92.  
  93.   Prog := 50;
  94.   Write(StringOfChar('#', Prog), ' - ', IntToStr(2*Prog), ' %', #13);
  95.   Writeln;
  96.  
  97.   F.Write(Buf[1], Length(Buf));
  98.   F.Free;
  99.  
  100.   Names.Free;
  101.  
  102.   stop2 := GetTickCount64;
  103.  
  104.  
  105.   WriteLn(Format('Total time: %f s', [    (stop2-start)/1000   ]));
  106.   WriteLn;
  107. end.
  108.  
« Last Edit: March 03, 2024, 06:36:59 pm by domasz »

Gustavo 'Gus' Carreno

  • Hero Member
  • *****
  • Posts: 1344
  • Professional amateur ;-P
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #58 on: March 03, 2024, 06:28:03 pm »
Hey JuhaManninen,

I'm assuming that there are duplicates, and I'm sorting them so we have the same list for every run.
If I understood right, you set Sorted:= True; before adding to items.
As mentioned by Alpine and myself, adding items to a long sorted list is VERY slow.
I don't know why you ignore this obvious place for optimization.
Better is if you sort the list after adding all items. Even then it is slower than a good Map container.

Around 3 minutes is not very fast in today's computers. You can do better.
Python is known to be slow. It has other advantages, not speed.

I'm not quite sure why you're still beating this drum.

I've already posted enough results that prove that I'm importing, sorting and de-duping ~45K weather stations in ~180 ms:
Code: Bash  [Select][+][-]
  1. $ time ./bin/generator -i ./data/weather_stations.csv -o ./data/measurements-1_000_000_000.txt -n 1_000_000_000
  2. Input File : /home/gcarreno/Programming/1brc-ObjectPascal/data/weather_stations.csv
  3. Output File: /home/gcarreno/Programming/1brc-ObjectPascal/data/measurements-1_000_000_000.txt
  4. Line Count : 1,000,000,000
  5.  
  6. Building Weather Stations...
  7. Done: Processed 44,691 entries from a total of 41,343 weather stations in 178 ms
  8.  
  9. [##################################################] 100.00 % lines: 1,000,000,000, file size: 16,892,043,902, elapsed: 2 min, 56 sec    
  10.  
  11. Done: file size: 16,892,043,902, elapsed: 2 min, 56 sec
  12.  
  13. real    2m59.505s
  14. user    2m39.890s
  15. sys     0m14.236s

The 3 minute mark is for producing 1 BILLION ROWS, not for the import, sorting and de-duping of the weather station list.
And, as amazing as it seems, I'm getting a consisting ~3m both from dumping into an SSD or an HDD!!

Can we, PLEASE, put this to rest?????

Cheers,
Gus
« Last Edit: March 03, 2024, 06:50:27 pm by Gustavo 'Gus' Carreno »

vfclists

  • Hero Member
  • *****
  • Posts: 1165
    • HowTos Considered Harmful?
Re: Why can't I beat Python on generating 1 Billion Rows?
« Reply #59 on: March 03, 2024, 06:50:37 pm »
Quote
real    2m59.505s
user    2m39.890s
sys     0m14.236s


Can anyone explain what real, sys, and user mean?

This seems to be one of those things which are not as obvious as they seem to be.
Lazarus 3.0/FPC 3.2.2

 

TinyPortal © 2005-2018