Recent

Author Topic: Sorting and Counting  (Read 8412 times)

julkas

  • Sr. Member
  • ****
  • Posts: 430
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #15 on: July 17, 2019, 06:07:38 pm »
gmap might depend on the number of unique values. If that gets large it might also slow down.
About ~ 10000000 random longint values.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #16 on: July 17, 2019, 06:40:45 pm »
@julkas, to generate a test file, I used this:
Code: Pascal  [Select]
  1. procedure CreateTestFile(const aFileName: string);
  2. var
  3.   I: Integer;
  4. const
  5.   TestSize = 1750000;
  6. begin
  7.   with TStringList.Create do
  8.     try
  9.       for I := 1 to TestSize do
  10.         Add(IntToStr(1500000000 + Random(800000)));
  11.       SaveToFile(aFileName);
  12.     finally
  13.       Free;
  14.     end;
  15. end;
  16.  
The output is a ~20MB text file, containing ~700000 unique values.

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #17 on: July 17, 2019, 08:42:08 pm »
Here's a example / benchmark of a version that uses Generics.Collections (available by default in trunk FPC, but also compatible with 3.0.4 if you just copy the sources to somewhere in whatever your unit search path is.)

Performance seems to be quite good (runs in around 0.8 - 1.1 seconds on average for me.)

Code: Pascal  [Select]
  1. program Example;
  2.  
  3. {$mode Delphi}
  4.  
  5. uses
  6.   SysUtils, DateUtils,
  7.   Generics.Defaults, Generics.Collections;
  8.  
  9. type
  10.   TIntPair = TPair<LongInt, LongInt>;
  11.   TIntMap = TDictionary<LongInt, LongInt>;
  12.  
  13.   function ComparePairs(constref L, R: TIntPair): LongInt;
  14.   begin
  15.     if L.Key < R.Key then
  16.       Result := -1
  17.     else if L.Key = R.Key then
  18.       Result := 0
  19.     else
  20.       Result := 1;
  21.   end;
  22.  
  23. var
  24.   I: LongInt;
  25.   Start: TDateTime;
  26.   InFile, OutFile: Text;
  27.   Map: TIntMap;
  28.   Pair: TIntPair;
  29.   Pairs: TArray<TIntPair>;
  30.  
  31. begin
  32.   // Generate a random test file first, with an adaptation of avk's method.
  33.   Randomize();
  34.   Assign(InFile, 'data.txt');
  35.   Rewrite(InFile);
  36.   for I := 1 to 1750000 do
  37.     WriteLn(InFile, 1500000000 + Random(800000));
  38.   Close(InFile);
  39.   I := 0;
  40.   // Start the timer only now, as this is where the real work starts.
  41.   Start := Now();
  42.   Map := TIntMap.Create();
  43.   // Allocate a big chunk of memory here in advance, for performance's sake.
  44.   // Doesn't matter if it's more than you end up needing, as Capacity is separate from Count.
  45.   Map.Capacity := 1750000;
  46.   Assign(InFile, 'data.txt');
  47.   Reset(InFile);
  48.   while not EOF(InFile) do begin
  49.     ReadLn(InFile, I);
  50.     if not Map.ContainsKey(I) then
  51.       Map.Add(I, 1)
  52.     else
  53.       Map[I] := Map[I] + 1;
  54.   end;
  55.   Close(InFile);
  56.   Pairs := Map.ToArray();
  57.   TArrayHelper<TIntPair>.Sort(
  58.     Pairs,
  59.     TComparer<TIntPair>.Construct(ComparePairs)
  60.   );
  61.   Assign(OutFile, 'output.txt');
  62.   Rewrite(OutFile);
  63.   for Pair in Pairs do with Pair do
  64.     WriteLn(OutFile, Key, ' - ', Value);
  65.   Close(OutFile);
  66.   Map.Free();
  67.   WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  68. end.
« Last Edit: July 18, 2019, 12:18:05 am by Akira1364 »

howardpc

  • Hero Member
  • *****
  • Posts: 3199
Re: Sorting and Counting
« Reply #18 on: July 17, 2019, 08:47:16 pm »
Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's  one(22 s.)  and it is prohibitively slow for a task of this size.
I have changed my mind.

A generic container like TStringList is not well suited to this task on large files.
Here follows a simple dynamic array solution. The following routine executes in less than 2s on my average 5-year-old machine, using a test file generated from your code.
Code: Pascal  [Select]
  1. program SortCountlpi;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes, sysutils;
  7.  
  8. procedure CreateTestFile(const aFileName: string);
  9. var
  10.   I: Integer;
  11. const
  12.   TestSize = 1750000;
  13. begin
  14.   with TStringList.Create do
  15.     try
  16.       for I := 1 to TestSize do
  17.         Add(IntToStr(1500000000 + Random(800000)));
  18.       SaveToFile(aFileName);
  19.     finally
  20.       Free;
  21.     end;
  22. end;
  23.  
  24. procedure SortCountViaArray(const anInfile, anOutFile: String);
  25. var
  26.   arr: array of Integer;
  27.   textf: TextFile;
  28.   min: Integer = High(Integer);
  29.   max: Integer = -1;
  30.   i: Integer;
  31. begin
  32.   if FileExists(anInfile) then begin
  33.     AssignFile(textf, anInfile);
  34.     Reset(textf);
  35.     while not EOF(textf) do
  36.       begin
  37.         ReadLn(textf, i);
  38.         if i < min then
  39.           min := i;
  40.         if i > max then
  41.           max := i;
  42.       end;
  43.     SetLength(arr, max-min+1);
  44.  
  45.     Reset(textf);
  46.     while not EOF(textf) do
  47.       begin
  48.         ReadLn(textf, i);
  49.         Dec(i, min);
  50.         Inc(arr[i]);
  51.       end;
  52.     CloseFile(textf);
  53.  
  54.     AssignFile(textf, anOutFile);
  55.     Rewrite(textf);
  56.     for i := Low(arr) to High(arr) do
  57.       case (arr[i] > 0) of
  58.         True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  59.       end;
  60.     CloseFile(textf);
  61.     SetLength(arr, 0);
  62.   end;
  63. end;
  64.  
  65. var
  66.   t: TDateTime;
  67.  
  68. begin
  69.   CreateTestFile('infile.txt');
  70.   t := Time;
  71.   SortCountViaArray('infile.txt', 'outfile.txt');
  72.   Writeln('Elapsed time: ',DateTimeToTimeStamp(Time -t).Time,' ms');
  73. end.
« Last Edit: July 17, 2019, 08:58:41 pm by howardpc »

mpknap

  • Jr. Member
  • **
  • Posts: 92
Re: Sorting and Counting
« Reply #19 on: July 17, 2019, 09:28:26 pm »
Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.

:)

jamie

  • Hero Member
  • *****
  • Posts: 2141
Re: Sorting and Counting
« Reply #20 on: July 17, 2019, 11:13:25 pm »
If you want faster you need to create a single chuck of memory that will house all of the file at once..

Load the file into that memory and write some code that shuffles the lines around. The idea is to not allow the memory manager to keep fragmentating the operation.

 Also putting it in a more simpler way, your fields are all the same size which makes this easy and since they are numbers only they can be converted into a Binary number and placed in a array to suites your needs..

  Once in the array you sort the array with simple algorithms.
 
 So step one is the read the file in line by line, convert each entry to a binary number and then store it in the
array..

 Once you have this data stored you then sort the array using something like a bubble sort.

 The array size can be calculated a head of time so you can dynamically create it..

 ArraySize := FileSize Div_NUmber_Chars_PerEntry_Plus2_For CRLF;

 So your array could be this
Array of Int64;
SetLength(MyArray, ArraySize);

etc.
the rest is just code...

Number 1 at blue screen app creations!

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #21 on: July 18, 2019, 12:30:37 am »
A generic container like TStringList is not well suited to this task on large files.

Well, an actually-generic hashmap that deals directly with the relevant type as opposed to unnecessarily stringifying the values is definitely well-suited to it. Your array solution is certainly quite fast though.

Here's both of ours wrapped up into a single command-line app with an option to tell it which to use:

Code: Pascal  [Select]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5.  
  6. uses
  7.   SysUtils, DateUtils,
  8.   Generics.Defaults, Generics.Collections;
  9.  
  10. type
  11.   TIntPair = TPair<LongInt, LongInt>;
  12.  
  13.   function ComparePairs(constref L, R: TIntPair): LongInt;
  14.   begin
  15.     if L.Key < R.Key then
  16.       Result := -1
  17.     else if L.Key = R.Key then
  18.       Result := 0
  19.     else
  20.       Result := 1;
  21.   end;
  22.  
  23.   procedure SortCountAkira;
  24.   var
  25.     I: LongInt;
  26.     InOut: Text;
  27.     Map: TDictionary<LongInt, LongInt>;
  28.     Pair: TIntPair;
  29.     Pairs: TArray<TIntPair>;
  30.   begin
  31.     if FileExists(ParamStr(2)) then begin
  32.       Map := TDictionary<LongInt, LongInt>.Create();
  33.       Map.Capacity := 10000000;
  34.       Assign(InOut, ParamStr(2));
  35.       Reset(InOut);
  36.       while not EOF(InOut) do begin
  37.         ReadLn(InOut, I);
  38.         if not Map.ContainsKey(I) then
  39.           Map.Add(I, 1)
  40.         else
  41.           Map[I] := Map[I] + 1;
  42.       end;
  43.       Close(InOut);
  44.       Pairs := Map.ToArray();
  45.       Map.Free();
  46.       TArrayHelper<TIntPair>.Sort(
  47.         Pairs,
  48.         TComparer<TIntPair>.Construct(ComparePairs)
  49.       );
  50.       Assign(InOut, ParamStr(3));
  51.       Rewrite(InOut);
  52.       for Pair in Pairs do with Pair do
  53.         WriteLn(InOut, Key, ' - ', Value);
  54.       Close(InOut);
  55.     end;
  56.   end;
  57.  
  58.   procedure SortCountHoward;
  59.   var
  60.     arr: array of Integer;
  61.     textf: TextFile;
  62.     min: Integer = High(Integer);
  63.     max: Integer = -1;
  64.     i: Integer;
  65.   begin
  66.     if FileExists(ParamStr(2)) then begin
  67.       AssignFile(textf, ParamStr(2));
  68.       Reset(textf);
  69.       while not EOF(textf) do
  70.         begin
  71.           ReadLn(textf, i);
  72.           if i < min then
  73.             min := i;
  74.           if i > max then
  75.             max := i;
  76.         end;
  77.       SetLength(arr, max-min+1);
  78.  
  79.       Reset(textf);
  80.       while not EOF(textf) do
  81.         begin
  82.           ReadLn(textf, i);
  83.           Dec(i, min);
  84.           Inc(arr[i]);
  85.         end;
  86.       CloseFile(textf);
  87.  
  88.       AssignFile(textf, ParamStr(3));
  89.       Rewrite(textf);
  90.       for i := Low(arr) to High(arr) do
  91.         case (arr[i] > 0) of
  92.           True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  93.         end;
  94.       CloseFile(textf);
  95.       SetLength(arr, 0);
  96.     end;
  97.   end;
  98.  
  99. var
  100.   Start: TDateTime;
  101.  
  102. begin
  103.   if ParamCount() <> 3 then
  104.     WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename')
  105.   else if ParamStr(1) = '-akira' then
  106.   begin
  107.     Start := Now();
  108.     SortCountAkira();
  109.     WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  110.   end
  111.   else if ParamStr(1) = '-howard' then
  112.   begin
  113.     Start := Now();
  114.     SortCountHoward();
  115.     WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  116.   end
  117.   else
  118.     WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename');  
  119. end.

Also a generator program for the input file (which gives something a little more heavyweight than what people have been using so far):

Code: Pascal  [Select]
  1. program InputGenerator;
  2.  
  3. var
  4.   InFile: Text;
  5.   I: LongInt;
  6.  
  7. begin
  8.   Randomize();
  9.   Assign(InFile, 'infile.txt');
  10.   Rewrite(InFile);
  11.   for I := 1 to 10000000 do
  12.     WriteLn(InFile, 1500000000 + Random(800000));
  13.   Close(InFile);
  14. end.

The TXT file is large, sorting takes more than an hour.

Wait, really? That's thousands and thousands of times too long if it's only 20MB or so.
« Last Edit: July 18, 2019, 04:16:25 am by Akira1364 »

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #22 on: July 18, 2019, 03:41:54 am »
@Akira1364, it seems your InputGenerator creates more than 99% of unique values.

Akira1364

  • Hero Member
  • *****
  • Posts: 539
Re: Sorting and Counting
« Reply #23 on: July 18, 2019, 04:13:47 am »
@Akira1364, it seems your InputGenerator creates more than 99% of unique values.

I was kind of going for testing the "worst case scenario", but it does seem a bit too worse case now that you point it out. I just edited it a bit in the comment.
« Last Edit: July 18, 2019, 04:17:31 am by Akira1364 »

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #24 on: July 18, 2019, 07:43:50 am »
I slightly changed the Akira1364's program:
Code: Pascal  [Select]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5.  
  6. uses
  7.   SysUtils, DateUtils,
  8.   Generics.Defaults, Generics.Collections,
  9.   LGUtils, , LGAbstractContainer, LGHashMultiSet, LGArrayHelpers;
  10.  
  11. type
  12.   TIntPair = TPair<LongInt, LongInt>;
  13.  
  14.   function ComparePairs(constref L, R: TIntPair): LongInt;
  15.   begin
  16.     if L.Key < R.Key then
  17.       Result := -1
  18.     else if L.Key = R.Key then
  19.       Result := 0
  20.     else
  21.       Result := 1;
  22.   end;
  23.  
  24. var
  25.   Total, Unique: Integer;
  26.   Start: TDateTime;
  27.  
  28.   procedure SortCountAkira;
  29.   var
  30.     I: LongInt;
  31.     InOut: Text;
  32.     Map: TDictionary<LongInt, LongInt>;
  33.     Pair: TIntPair;
  34.     Pairs: TArray<TIntPair>;
  35.   begin
  36.     Map := TDictionary<LongInt, LongInt>.Create();
  37.     Map.Capacity := 10000000;
  38.     Assign(InOut, ParamStr(1));
  39.     Reset(InOut);
  40.     while not EOF(InOut) do begin
  41.       ReadLn(InOut, I);
  42.       Inc(Total);
  43.       if not Map.ContainsKey(I) then
  44.         begin
  45.           Map.Add(I, 1);
  46.           Inc(Unique);
  47.         end
  48.       else
  49.         Map[I] := Map[I] + 1;
  50.     end;
  51.     Close(InOut);
  52.     Pairs := Map.ToArray();
  53.     Map.Free();
  54.     TArrayHelper<TIntPair>.Sort(
  55.       Pairs,
  56.       TComparer<TIntPair>.Construct(ComparePairs)
  57.     );
  58.     Assign(InOut, ParamStr(2));
  59.     Rewrite(InOut);
  60.     for Pair in Pairs do with Pair do
  61.       WriteLn(InOut, Key, ' - ', Value);
  62.     Close(InOut);
  63.   end;
  64.  
  65.   procedure SortCountHoward;
  66.   var
  67.     arr: array of Integer;
  68.     textf: TextFile;
  69.     min: Integer = High(Integer);
  70.     max: Integer = -1;
  71.     i: Integer;
  72.   begin
  73.     AssignFile(textf, ParamStr(1));
  74.     Reset(textf);
  75.     while not EOF(textf) do
  76.       begin
  77.         ReadLn(textf, i);
  78.         Inc(Total);
  79.         if i < min then
  80.           min := i;
  81.         if i > max then
  82.           max := i;
  83.       end;
  84.     SetLength(arr, max-min+1);
  85.  
  86.     Reset(textf);
  87.     while not EOF(textf) do
  88.       begin
  89.         ReadLn(textf, i);
  90.         Dec(i, min);
  91.         Inc(arr[i]);
  92.       end;
  93.     CloseFile(textf);
  94.  
  95.     AssignFile(textf, ParamStr(2));
  96.     Rewrite(textf);
  97.     for i := Low(arr) to High(arr) do
  98.       case (arr[i] > 0) of
  99.         True:
  100.           begin
  101.             WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  102.             Inc(Unique);
  103.           end;
  104.       end;
  105.     CloseFile(textf);
  106.     SetLength(arr, 0);
  107.   end;
  108.  
  109.   function EntryCmp(constref L, R: TGMultisetEntry<Integer>): SizeInt;
  110.   begin
  111.     if L.Key > R.Key then
  112.       Result := 1
  113.     else
  114.       if L.Key < R.Key then
  115.         Result := -1
  116.       else
  117.         Result := 0;
  118.   end;  
  119.  
  120.   procedure SortCountAvk1;
  121.   type
  122.     TCounter  = TGHashMultiSetLP<Integer>;
  123.     TCountRef = TGAutoRef<TCounter>;
  124.     TEntry    = TCounter.TEntry;
  125.   var
  126.     CountRef: TCountRef;
  127.     InOut: Text;
  128.     Counter: TCounter;
  129.     e: TEntry;
  130.     I: Integer;
  131.   begin
  132.     Counter := CountRef;
  133.     Assign(InOut, ParamStr(1));
  134.     Reset(InOut);
  135.     while not EOF(InOut) do
  136.       begin
  137.         ReadLn(InOut, I);
  138.         Counter.Add(I);
  139.       end;
  140.     Close(InOut);
  141.     Total := Counter.Count;
  142.     Unique := Counter.EntryCount;
  143.     if Counter.NonEmpty then
  144.       begin
  145.         Assign(InOut, ParamStr(2));
  146.         Rewrite(InOut);
  147.         for e in Counter.Entries.Sorted(@EntryCmp) do
  148.           with e do
  149.              WriteLn(InOut, Key, ' - ', Count);
  150.         Close(InOut);
  151.       end;
  152.   end;
  153.  
  154.   procedure SortCountAvk2;
  155.   var
  156.     List: array of Integer;
  157.     InOut: Text;
  158.     I, J, Count, DupCount: Integer;
  159.   begin
  160.     Assign(InOut, ParamStr(1));
  161.     Reset(InOut);
  162.     SetLength(List, 4096);
  163.     I := 0;
  164.     while not EOF(InOut) do
  165.       begin
  166.         ReadLn(InOut, J);
  167.         Inc(Total);
  168.         if Length(List) = I then
  169.           SetLength(List, I * 2);
  170.         List[I] := J;
  171.         Inc(I);
  172.       end;
  173.     Close(InOut);
  174.     SetLength(List, I);
  175.     if List = nil then
  176.       exit;
  177.     TGOrdinalArrayHelper<Integer>.Sort(List);
  178.     Count := I;
  179.     DupCount := 0;
  180.     I := 0;
  181.     Assign(InOut, ParamStr(2));
  182.     Rewrite(InOut);
  183.     repeat
  184.       J := List[I];
  185.       while (I < Count) and (List[I] = J) do
  186.         begin
  187.           Inc(DupCount);
  188.           Inc(I);
  189.         end;
  190.       WriteLn(InOut, J, ' - ', DupCount);
  191.       Inc(Unique);
  192.       DupCount := 0;
  193.     until I = Count;
  194.     Close(InOut);
  195.   end;
  196.  
  197.   procedure Run(aProc: TProcedure);
  198.   begin
  199.     Total := 0;
  200.     Unique := 0;
  201.     Start := Now;
  202.     try
  203.       aProc();
  204.       WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  205.       WriteLn('total values: ', Total, ', unique values: ', Unique);
  206.     except
  207.       on e: Exception do
  208.         WriteLn('crashes with message "', e.Message, '"');
  209.     end;
  210.   end;
  211.  
  212. begin
  213.   if ParamCount <> 2 then
  214.     begin
  215.       WriteLn('Usage: occurrencecounter infilename outfilename');
  216.       exit;
  217.     end;
  218.   if not FileExists(ParamStr(1)) then
  219.     begin
  220.       WriteLn('Input file "', ParamStr(1), '" not found');
  221.       exit;
  222.     end;
  223.  
  224.   WriteLn('running SortCountAkira:');
  225.   Run(@SortCountAkira);
  226.   WriteLn;
  227.  
  228.   WriteLn('running SortCountHoward:');
  229.   Run(@SortCountHoward);
  230.   WriteLn;
  231.  
  232.  
  233.   WriteLn('running SortCountAvk1:');
  234.   Run(@SortCountAvk1);
  235.   WriteLn;
  236.  
  237.   WriteLn('running SortCountAvk2:');
  238.   Run(@SortCountAvk2);
  239. end.
  240.  
using current version of his InputGenerator output is(32-bit compiler):
Code: Text  [Select]
  1. running SortCountAkira:
  2. elapsed time: 4.7730
  3. total values: 10000000, unique values: 799999
  4.  
  5. running SortCountHoward:
  6. elapsed time: 6.0220
  7. total values: 10000000, unique values: 799999
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 3.7910
  11. total values: 10000000, unique values: 799999
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 2.9480
  15. total values: 10000000, unique values: 799999
  16.  
using previous version:
Code: Text  [Select]
  1. running SortCountAkira:
  2. elapsed time: 14.5700
  3. total values: 10000000, unique values: 9976599
  4.  
  5. running SortCountHoward:
  6. crashes with message "Invalid pointer operation"
  7.  
  8. running SortCountAvk1:
  9. elapsed time: 8.1750
  10. total values: 10000000, unique values: 9976599
  11.  
  12. running SortCountAvk2:
  13. elapsed time: 6.6450
  14. total values: 10000000, unique values: 9976599
  15.  
« Last Edit: July 18, 2019, 11:34:51 am by avk »

Thaddy

  • Hero Member
  • *****
  • Posts: 9278
Re: Sorting and Counting
« Reply #25 on: July 18, 2019, 08:37:51 am »
@ AVK
You can optimize a little with {$I-}{$H-} or did you already do that? {$I-} speeds up writeln consideably.
Also I wouldn't use Randomize() but use a fixed seed, so the file becomes reproducable.
« Last Edit: July 18, 2019, 08:48:36 am by Thaddy »
also related to equus asinus.

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #26 on: July 18, 2019, 09:25:34 am »
@Thaddy, I think it does not make much sense. The idea was to show that for this task there are ways to solve without shamanic dances with a tambourine and in a reasonable time.

julkas

  • Sr. Member
  • ****
  • Posts: 430
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #27 on: July 18, 2019, 10:00:23 am »
GMap random test. Out file size ~ 146 MB
Code: Pascal  [Select]
  1. program sc;
  2. {$mode delphi}
  3. uses gmap, gutil, SysUtils;
  4. const
  5.   keyNum = 10000000;
  6. type
  7.   TIntLess = TLess<LongInt>;
  8.   TDict = TMap<LongInt, LongInt, TIntLess>;
  9. var
  10.   sc: TDict;
  11.   scit: TDict.TIterator;
  12.   i: LongInt;
  13.   key, cnt: LongInt;
  14.   start: QWord;
  15.   outFile: Text;
  16. begin
  17.   start := GetTickCount64();
  18.   sc := TDict.Create;
  19.   for i := 0 to keyNum do
  20.   begin
  21.     key := Random(2147483647);
  22.     cnt := 0;
  23.     sc.TryGetValue(key, cnt);
  24.     sc[key] := cnt + 1;
  25.   end;
  26.   WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  27.   WriteLn('Uniq keys - ', sc.Size, ', out of - ', keyNum);
  28.   Assign(outFile, 'out.txt');
  29.   Rewrite(outFile);
  30.   scit := sc.Min;
  31.   repeat
  32.     WriteLn(outFile, scit.Key, ' - ', scit.Value)
  33.   until not scit.Next;
  34.   Close(outFile);
  35.   scit.Destroy;
  36.   sc.Destroy;
  37.   WriteLn('Total (ticks) - ', GetTickCount64() - start);
  38.   ReadLn;
  39. end.
  40.  
Console output -
Code: Text  [Select]
  1. Populated (ticks) - 17297
  2. Uniq keys - 9976566, out of - 10000000
  3. Total (ticks) - 23906
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

julkas

  • Sr. Member
  • ****
  • Posts: 430
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #28 on: July 18, 2019, 11:06:58 am »
Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.
Can you share input file?
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

SymbolicFrank

  • Hero Member
  • *****
  • Posts: 635
Re: Sorting and Counting
« Reply #29 on: July 18, 2019, 12:39:15 pm »
I did this for more than 25 million postal codes. It was really slow with the default containers and took a lot of memory. I made a custom class that used a smart way to insert new values, which reduced it to less than 15 minutes and 2 GB memory usage. But I stored more than one number in more than one container. Requesting a postal code was nearly instant.

I kept a small, loose index and remembered the last few insertions. After that, I just tried to insert a new entry halfway between the two nearest indexed items, and then repeatedly halfway until a hit was found. Which took less than 5 tries on average.