Recent

Author Topic: Sorting and Counting  (Read 7956 times)

Thaddy

  • Hero Member
  • *****
  • Posts: 9184
Re: Sorting and Counting
« Reply #30 on: July 18, 2019, 12:49:02 pm »
It may be a lot quicker with a memory mapped file approach. Especially on very long files. Then the readln is from memory most of the time, not from disk.
also related to equus asinus.

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 635
Re: Sorting and Counting
« Reply #31 on: July 18, 2019, 01:22:15 pm »
I made my own SaveToFile and LoadFromFile, which created a text file that was 4.5 GB in size, IIRC. It took 2 minutes to load that into memory. But less than a millisecond to query.

Btw, this was on a fast Linux sever with lots of memory and SSDs.
« Last Edit: July 18, 2019, 01:24:25 pm by SymbolicFrank »

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #32 on: July 18, 2019, 01:48:32 pm »
In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby.
Check also this - https://en.wikipedia.org/wiki/External_sorting.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

mangakissa

  • Hero Member
  • *****
  • Posts: 944
Re: Sorting and Counting
« Reply #33 on: July 18, 2019, 02:19:49 pm »
My sample with quicksort and a 30 mb file size takes 69 seconds WITHOUT ANY TSTRINGLIST.
Just old program developping.
Lazarus 1.84 (32b) / FPC 3.0.4
Windows 10

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 635
Re: Sorting and Counting
« Reply #34 on: July 18, 2019, 02:37:03 pm »
Storage in memory is also interesting. A linked list allows fast inserts, but is slow to access. An array has to be shifted. And indexes consist of arrays as well.

IIRC, I ended up with 100 separate arrays that all did a subset. That was fastest overall. And finding the right one was based on quicksort.

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #35 on: July 18, 2019, 02:49:45 pm »
I slightly changed the Akira1364's program:

Yeah, LGenerics is really good. The MultiSet is definitely an even more appropriate data structure than a hashmap (which TDictionary is) for this.

avk

  • Full Member
  • ***
  • Posts: 154
    • my self-education project
Re: Sorting and Counting
« Reply #36 on: July 18, 2019, 04:05:49 pm »
Yeah, LGenerics is really good...
Nice to hear kindly words from you. :)

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #37 on: July 18, 2019, 04:19:41 pm »
Nice to hear kindly words from you. :)

Oh, I didn't realize you were the author! Great job on it, again. It's by far the most advanced / feature-complete generics library ever written in Object Pascal.

On another note, I'm still extremely curious as to what "mpknap" is doing with his file currently that takes over an hour... is he using BogoSort or something? That's just crazily long.
« Last Edit: July 18, 2019, 04:24:59 pm by Akira1364 »

avk

  • Full Member
  • ***
  • Posts: 154
    • my self-education project
Re: Sorting and Counting
« Reply #38 on: July 18, 2019, 04:28:05 pm »
Thank you very much.
@mangakissa, can you submit full version of your solution?

mangakissa

  • Hero Member
  • *****
  • Posts: 944
Re: Sorting and Counting
« Reply #39 on: July 18, 2019, 04:52:20 pm »
the whole project  :D
Lazarus 1.84 (32b) / FPC 3.0.4
Windows 10

avk

  • Full Member
  • ***
  • Posts: 154
    • my self-education project
Re: Sorting and Counting
« Reply #40 on: July 18, 2019, 06:33:03 pm »
@julkas, @mangakissa
I changed your code a bit and pasted it into the Akira1364's app:
Code: Pascal  [Select]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, gmap;
  12.  
  13. type
  14.   TIntPair = TPair<LongInt, LongInt>;
  15.  
  16.   function ComparePairs(constref L, R: TIntPair): LongInt;
  17.   begin
  18.     if L.Key < R.Key then
  19.       Result := -1
  20.     else if L.Key = R.Key then
  21.       Result := 0
  22.     else
  23.       Result := 1;
  24.   end;
  25.  
  26. var
  27.   Total, Unique: Integer;
  28.   Start: TDateTime;
  29.  
  30.   procedure SortCountAkira;
  31.   var
  32.     I: LongInt;
  33.     InOut: Text;
  34.     Map: TDictionary<LongInt, LongInt>;
  35.     Pair: TIntPair;
  36.     Pairs: TArray<TIntPair>;
  37.   begin
  38.     Map := TDictionary<LongInt, LongInt>.Create();
  39.     Map.Capacity := 10000000;
  40.     Assign(InOut, ParamStr(1));
  41.     Reset(InOut);
  42.     while not EOF(InOut) do begin
  43.       ReadLn(InOut, I);
  44.       Inc(Total);
  45.       if not Map.ContainsKey(I) then
  46.         begin
  47.           Map.Add(I, 1);
  48.           Inc(Unique);
  49.         end
  50.       else
  51.         Map[I] := Map[I] + 1;
  52.     end;
  53.     Close(InOut);
  54.     Pairs := Map.ToArray();
  55.     Map.Free();
  56.     TArrayHelper<TIntPair>.Sort(
  57.       Pairs,
  58.       TComparer<TIntPair>.Construct(ComparePairs)
  59.     );
  60.     Assign(InOut, ParamStr(2));
  61.     Rewrite(InOut);
  62.     for Pair in Pairs do with Pair do
  63.       WriteLn(InOut, Key, ' - ', Value);
  64.     Close(InOut);
  65.   end;
  66.  
  67.   procedure SortCountHoward;
  68.   var
  69.     arr: array of Integer;
  70.     textf: TextFile;
  71.     min: Integer = High(Integer);
  72.     max: Integer = -1;
  73.     i: Integer;
  74.   begin
  75.     AssignFile(textf, ParamStr(1));
  76.     Reset(textf);
  77.     while not EOF(textf) do
  78.       begin
  79.         ReadLn(textf, i);
  80.         Inc(Total);
  81.         if i < min then
  82.           min := i;
  83.         if i > max then
  84.           max := i;
  85.       end;
  86.     SetLength(arr, max-min+1);
  87.  
  88.     Reset(textf);
  89.     while not EOF(textf) do
  90.       begin
  91.         ReadLn(textf, i);
  92.         Dec(i, min);
  93.         Inc(arr[i]);
  94.       end;
  95.     CloseFile(textf);
  96.  
  97.     AssignFile(textf, ParamStr(2));
  98.     Rewrite(textf);
  99.     for i := Low(arr) to High(arr) do
  100.       case (arr[i] > 0) of
  101.         True:
  102.           begin
  103.             WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  104.             Inc(Unique);
  105.           end;
  106.       end;
  107.     CloseFile(textf);
  108.     SetLength(arr, 0);
  109.   end;
  110.  
  111.   procedure SortCountAvk1;
  112.   type
  113.     TCounter  = TGHashMultiSetLP<Integer>;
  114.     TCountRef = TGAutoRef<TCounter>;
  115.     TEntry    = TCounter.TEntry;
  116.  
  117.     function EntryCmp(constref L, R: TEntry): SizeInt;
  118.     begin
  119.       if L.Key > R.Key then
  120.         Result := 1
  121.       else
  122.         if L.Key < R.Key then
  123.           Result := -1
  124.         else
  125.           Result := 0;
  126.     end;
  127.  
  128.   var
  129.     CountRef: TCountRef;
  130.     InOut: Text;
  131.     Counter: TCounter;
  132.     e: TEntry;
  133.     I: Integer;
  134.   begin
  135.     Counter := CountRef;
  136.     Counter.LoadFactor := 0.7;
  137.     Assign(InOut, ParamStr(1));
  138.     Reset(InOut);
  139.     while not EOF(InOut) do
  140.       begin
  141.         ReadLn(InOut, I);
  142.         Counter.Add(I);
  143.       end;
  144.     Close(InOut);
  145.     Total := Counter.Count;
  146.     Unique := Counter.EntryCount;
  147.     if Counter.NonEmpty then
  148.       begin
  149.         Assign(InOut, ParamStr(2));
  150.         Rewrite(InOut);
  151.         for e in Counter.Entries.Sorted(EntryCmp) do
  152.           with e do
  153.             WriteLn(InOut, Key, ' - ', Count);
  154.         Close(InOut);
  155.       end;
  156.   end;
  157.  
  158.   procedure SortCountAvk2;
  159.   var
  160.     List: array of Integer;
  161.     InOut: Text;
  162.     I, J, Count, DupCount: Integer;
  163.   begin
  164.     Assign(InOut, ParamStr(1));
  165.     Reset(InOut);
  166.     SetLength(List, 4096);
  167.     I := 0;
  168.     while not EOF(InOut) do
  169.       begin
  170.         ReadLn(InOut, J);
  171.         Inc(Total);
  172.         if Length(List) = I then
  173.           SetLength(List, I * 2);
  174.         List[I] := J;
  175.         Inc(I);
  176.       end;
  177.     Close(InOut);
  178.     SetLength(List, I);
  179.     if List = nil then
  180.       exit;
  181.     TGOrdinalArrayHelper<Integer>.Sort(List);
  182.     Count := I;
  183.     DupCount := 0;
  184.     I := 0;
  185.     Assign(InOut, ParamStr(2));
  186.     Rewrite(InOut);
  187.     repeat
  188.       J := List[I];
  189.       while (I < Count) and (List[I] = J) do
  190.         begin
  191.           Inc(DupCount);
  192.           Inc(I);
  193.         end;
  194.       WriteLn(InOut, J, ' - ', DupCount);
  195.       Inc(Unique);
  196.       DupCount := 0;
  197.     until I = Count;
  198.     Close(InOut);
  199.   end;
  200.  
  201.   procedure SortCountJulkas;
  202.   type
  203.     TIntLess = TLess<LongInt>;
  204.     TDict = TMap<LongInt, LongInt, TIntLess>;
  205.   var
  206.     sc: TDict;
  207.     scit: TDict.TIterator;
  208.     InOut: Text;
  209.     key, cnt: LongInt;
  210.   begin
  211.     sc := TDict.Create;
  212.     Assign(InOut, ParamStr(1));
  213.     Reset(InOut);
  214.     while not EOF(InOut) do
  215.       begin
  216.         ReadLn(InOut, key);
  217.         Inc(Total);
  218.         cnt := 0;
  219.         sc.TryGetValue(key, cnt);
  220.         sc[key] := cnt + 1;
  221.       end;
  222.     Close(InOut);
  223.     Unique := sc.Size;
  224.     if Unique > 0 then
  225.       begin
  226.         Assign(InOut, ParamStr(2));
  227.         Rewrite(InOut);
  228.         scit := sc.Min;
  229.         repeat
  230.           WriteLn(InOut, scit.Key, ' - ', scit.Value);
  231.         until not scit.Next;
  232.         Close(InOut);
  233.         scit.Free;
  234.       end;
  235.     sc.Free;
  236.   end;
  237.  
  238.   procedure SortCountMangakissa;
  239.   type
  240.     TMyTime = packed record
  241.       Unixtime : Integer;
  242.       Counter  : word;
  243.     end;
  244.  
  245.     function Find(var aMyTime : array of TMytime; aLine : Integer) : boolean;
  246.     var index : integer;
  247.     begin
  248.       result := false;
  249.       if length(aMyTime) > 0 then
  250.       begin
  251.         for index := low(aMyTime) to high(aMyTime) do
  252.         begin
  253.           if aMyTime[index].Unixtime = aLine then
  254.           begin
  255.             aMyTime[index].Counter := aMyTime[index].Counter + 1;
  256.             result := true;
  257.             break;
  258.           end;
  259.         end;
  260.       end;
  261.     end;
  262.  
  263.     procedure QuickSort(var A: array of tMytime; iLo, iHi: Integer) ;
  264.      var
  265.        Lo, Hi, Pivot : Integer;
  266.        T             : TMyTime;
  267.      begin
  268.        Lo := iLo;
  269.        Hi := iHi;
  270.        Pivot := A[(Lo + Hi) div 2].Unixtime;
  271.        repeat
  272.          while A[Lo].Unixtime < Pivot do Inc(Lo) ;
  273.          while A[Hi].Unixtime > Pivot do Dec(Hi) ;
  274.          if Lo <= Hi then
  275.          begin
  276.            T := A[Lo];
  277.            A[Lo] := A[Hi];
  278.            A[Hi] := T;
  279.            Inc(Lo) ;
  280.            Dec(Hi) ;
  281.          end;
  282.        until Lo > Hi;
  283.        if Hi > iLo then QuickSort(A, iLo, Hi) ;
  284.        if Lo < iHi then QuickSort(A, Lo, iHi) ;
  285.      end;
  286.   var
  287.     InOut         : Text;
  288.     MyTime        : array of TMyTime = nil;
  289.     Item          : TMyTime;
  290.     myline, index : integer;
  291.   begin
  292.     Assign(InOut, ParamStr(1));
  293.     Reset(InOut);
  294.     index := 0;
  295.     while not EOF(InOut) do
  296.       begin
  297.         ReadLn(InOut, myline);
  298.         Inc(Total);
  299.         if not find(MyTime, myline) then
  300.         begin
  301.           index := index + 1;
  302.           setlength(MyTime,index);
  303.           MyTime[index - 1].Unixtime := myLine;
  304.           MyTime[index - 1].Counter := 1;
  305.         end;
  306.       end;
  307.     Close(InOut);
  308.     if MyTime = nil then
  309.       exit;
  310.     Unique := Length(MyTime);
  311.     QuickSort(Mytime, low(MyTime), high(mytime));
  312.     Assign(InOut, ParamStr(2));
  313.     Rewrite(InOut);
  314.     for Item in Mytime do
  315.       with Item do
  316.         WriteLn(InOut, Unixtime, ' - ', Counter);
  317.     Close(InOut);
  318.   end;
  319.  
  320.   procedure Run(aProc: TProcedure);
  321.   begin
  322.     Total := 0;
  323.     Unique := 0;
  324.     Start := Now;
  325.     try
  326.       aProc();
  327.       WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  328.       WriteLn('#total: ', Total, ', #unique: ', Unique);
  329.     except
  330.       on e: Exception do
  331.         WriteLn('crashes with message "', e.Message, '"');
  332.     end;
  333.   end;
  334.  
  335. begin
  336.   if ParamCount <> 2 then
  337.     begin
  338.       WriteLn('Usage: OccurrenceCounter infilename outfilename');
  339.       exit;
  340.     end;
  341.   if not FileExists(ParamStr(1)) then
  342.     begin
  343.       WriteLn('Input file "', ParamStr(1), '" not found');
  344.       exit;
  345.     end;
  346.  
  347.   WriteLn('running SortCountAkira:');
  348.   Run(SortCountAkira);
  349.   WriteLn;
  350.  
  351.   WriteLn('running SortCountHoward:');
  352.   Run(@SortCountHoward);
  353.   WriteLn;
  354.  
  355.  
  356.   WriteLn('running SortCountAvk1:');
  357.   Run(SortCountAvk1);
  358.   WriteLn;
  359.  
  360.   WriteLn('running SortCountAvk2:');
  361.   Run(SortCountAvk2);
  362.   WriteLn;
  363.  
  364.   WriteLn('running SortCountJulkas:');
  365.   Run(SortCountJulkas);
  366.   WriteLn;
  367.  
  368.   WriteLn('running SortCountMangakissa:');
  369.   Run(SortCountMangakissa);
  370. end.
  371.  
Results for 20MB file:
Code: Text  [Select]
  1. running SortCountAkira:
  2. elapsed time: 1.6700
  3. #total: 1750000, #unique: 710248
  4.  
  5. running SortCountHoward:
  6. elapsed time: 1.5500
  7. #total: 1750000, #unique: 710248
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 1.0100
  11. #total: 1750000, #unique: 710248
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 0.7200
  15. #total: 1750000, #unique: 710248
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 2.6600
  19. #total: 1750000, #unique: 710248
  20.  
  21. running SortCountMangakissa:
  22. elapsed time: 346.6510
  23. #total: 1750000, #unique: 710248
  24.  
  25.  

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #41 on: July 18, 2019, 07:44:00 pm »
@Avk:

What do you get if you change the

Code: Pascal  [Select]
  1. Map.Capacity := 10000000;

in mine to something closer to the expected total for the 20MB file (as opposed to the 100+ MB one.)

Like:

Code: Pascal  [Select]
  1. Map.Capacity := 1750000;

or whatever. Or even if you just comment out that line completely, as for that much smaller expected input it may not be worth it at all over how TDictionary pre-allocates internally already.

Also FWIW, I'm quite sure that for Mangakissa's version, it's this part:

Code: Pascal  [Select]
  1. index := index + 1;
  2. setlength(MyTime,index);

that hurts the performance.
« Last Edit: July 18, 2019, 07:45:46 pm by Akira1364 »

julkas

  • Sr. Member
  • ****
  • Posts: 416
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #42 on: July 18, 2019, 07:44:44 pm »
@avk thanks for benchmark.
It's old but very good problem. You can learn something new - sorting algorithm, hashing, counting, ...
My approach is not fastest, but clean and short. I use only one class from FPC 3.0.4. May be tomorrow I will post faster variant.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

440bx

  • Hero Member
  • *****
  • Posts: 1199
Re: Sorting and Counting
« Reply #43 on: July 18, 2019, 07:56:39 pm »
@avk

did you use Akira's InputGenerator program to generate the input file ?

One reason I'm asking is because the code I copied from Akira's post generates a 120MB file.  Did you use a modified version ?  If so, could you also post whatever program you used to generate the input file ?

Thank you.


using FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #44 on: July 18, 2019, 08:12:57 pm »
did you use Akira's InputGenerator program to generate the input file ?

I imagine he used the original "20MB" code of his from an earlier comment for that one.

If you change my generator program to look like this:

Code: Pascal  [Select]
  1. program InputGenerator;
  2.  
  3. var
  4.   InFile: Text;
  5.   I: LongInt;
  6.  
  7. begin
  8.   Randomize();
  9.   Assign(InFile, 'infile.txt');
  10.   Rewrite(InFile);
  11.   for I := 1 to 1750000 do
  12.     WriteLn(InFile, 1500000000 + Random(800000));
  13.   Close(InFile);
  14. end.

It will create the smaller 20MB file (with 1750000 total entries, versus the 10000000 total entries that the 100+ MB file has.)