Lazarus

Programming => General => Topic started by: mpknap on July 16, 2019, 07:35:20 am

Title: Sorting and Counting
Post by: mpknap on July 16, 2019, 07:35:20 am
Hello.
I have a text file and in it unixtime times,
this is how it looks:

file.txt
1562551080

1562551140

1562551260

1562551260

1562551260

1562551260

1562551320

1562551380
.......


The file is very large. about 20MB.

Many lines are repeated, others are only once. The task that I want to do is to calculate how many times it is the same. So make an array with Unix time and the number of repetitions for a given time.

After sorting and calculation it should be like this:
1562551080 - 1

1562551140 - 1

1562551260 - 4

1562551320 -1

1562551380 -1
......


Any ideas?
Title: Re: Sorting and Counting
Post by: mangakissa on July 16, 2019, 08:17:48 am
You give the solution by yourself; a two dimensional array.
Title: Re: Sorting and Counting
Post by: julkas on July 16, 2019, 09:07:00 am
Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf
Title: Re: Sorting and Counting
Post by: avk on July 16, 2019, 09:11:17 am
A data structure called "multiset" is well suited to your task.
Title: Re: Sorting and Counting
Post by: mangakissa on July 16, 2019, 04:20:30 pm
Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf
FGL is standard with TFPGMap in it.

But this is easy to do with an 2 dimensional array. After finding the doubles, you can sort in every way yo want.
Title: Re: Sorting and Counting
Post by: jamie on July 16, 2019, 10:47:22 pm
Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

Title: Re: Sorting and Counting
Post by: winni on July 16, 2019, 11:29:05 pm
Yes, as Jamie said:

* load it into a stringlist
* set the stringlist.sorted to true - and wait, 20 MB mlight take some time
* count the duplicates
* and now: write them with your format: ( time - count  ) into a second stringlist
* save it to disk - done

In case of Linux/Unix: this goes really faster with the shell and sort and uniq …….....

Winni
Title: Re: Sorting and Counting
Post by: mpknap on July 17, 2019, 07:00:05 am
Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby. In life I am Bob the Builder :). I like to write something in my spare time, probably only to prove myself. I am interested in the CREDO project, I use their data. This time I want to juxtapose them with the detection of global earthquakes, I wonder if it has any connection, although nobody writes about it ;).

Here are some tools for the CREDO project https://github.com/credo-science/Windows-Tools

And the CREDO project itself here: https://credo.science/


Yes, I started, but I do not know how to count duplicates and how to do an array with the results. I have so much code for now. Good start?

Code: Pascal  [Select][+][-]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   BLOCK_SIZE = 10000;
  4.  
  5. var
  6.   f: textfile;
  7.   Times: TStringList;
  8.   det, separator: string;
  9.   i: integer;
  10.  
  11. begin
  12.   czas := TStringList.Create;
  13.   czas.Duplicates := dupIgnore;
  14.   czas.Sorted := True;
  15.  
  16.   AssignFile(f, 'Base_total.txt');
  17.   reset(f);
  18.  
  19.   SetLength(det, 0);
  20.   FillChar(det, SizeOf(det), 0);
  21.   while not EOF(f) do
  22.   begin
  23.  
  24.     if i mod BLOCK_SIZE = 0 then
  25.       SetLength(det, Length(det) + BLOCK_SIZE);
  26.  
  27.     readln(f, det);
  28.     czas.Add(det);
  29.     readln(f, separator);
  30.     Inc(i);
  31.   end;
  32.   closefile(f);
  33. end;    
Title: Re: Sorting and Counting
Post by: engkin on July 17, 2019, 08:47:31 am
Maybe this:
Code: Pascal  [Select][+][-]
  1. procedure CountDuplicates(AFileName, AResultFileName: String);
  2. var
  3.   List, Res: TStringList;
  4.   Item: String = '';
  5.   Count: integer = 0;
  6.   i: Integer = 0;
  7. begin
  8.   List := TStringList.Create;
  9.   Res := TStringList.Create;
  10.   try
  11.     List.LoadFromFile(AFileName);
  12.     List.Sorted := True;
  13.     while i<List.Count do
  14.     begin
  15.       // Get current item
  16.       Item := List[i];
  17.  
  18.       // Count similar items
  19.       Count := 0;
  20.       while (i<List.Count) and (List[i]=Item) do
  21.       begin
  22.         inc(i);
  23.         inc(count);
  24.       end;
  25.  
  26.       // Add the result
  27.       Res.Add(Format('%s - %d', [Item, Count]));
  28.     end;
  29.  
  30.     Res.SaveToFile(AResultFileName);
  31.   finally
  32.     List.Free;
  33.     Res.Free;
  34.   end;
  35. end;
  36.  

To use it:
Code: Pascal  [Select][+][-]
  1.   CountDuplicates('in.txt','out.txt');

The result would be saved in out.txt file.
Notice that the first result is going to be the number of empty lines, if any.
Title: Re: Sorting and Counting
Post by: mangakissa on July 17, 2019, 08:53:44 am
Code: Pascal  [Select][+][-]
  1. type
  2.  
  3.   TMyTime = packed record
  4.     Unixtime : string;
  5.     Counter  : word;
  6.   end;
  7.  
  8. implementation
  9.  
  10. procedure TForm1.Button1Click(Sender: TObject);
  11. var f      : textfile;
  12.     myline : string;
  13.     MyTime : array of TMyTime;
  14.     index  : integer;
  15. begin
  16.   assignfile(f,'file.txt');
  17.   reset(f);
  18.   index := 0;
  19.   while not eof(f) do
  20.   begin
  21.     readln(f,myline);
  22.     if not find(MyTime, myline) then
  23.     begin
  24.       index := index + 1;
  25.       setlength(MyTime,index);
  26.       MyTime[index - 1].Unixtime := myLine;
  27.       MyTime[index - 1].Counter := 1;
  28.     end;
  29.   end;
  30.   closefile(f);
  31.   //use bubblesort, quicksort, heapsort or other sort
  32.   for index := 0 to length(myTime) - 1 do
  33.     memo1.Lines.add(format('time : %s   duplicates : %3d',[myTime[index].Unixtime,myTime[index].Counter]));
  34. end;
  35.  
  36. function TForm1.Find(var aMyTime : array of TMytime; const aLine : string) : boolean;
  37. var index : integer;
  38. begin
  39.   result := false;
  40.   if length(aMyTime) > 0 then
  41.   begin
  42.     for index := low(aMyTime) to high(aMyTime) do
  43.     begin
  44.       if aMyTime[index].Unixtime = aLine then
  45.       begin
  46.         aMyTime[index].Counter := aMyTime[index].Counter + 1;
  47.         result := true;
  48.         break;
  49.       end;
  50.     end;
  51.   end;
  52. end;
  53.  
  54. end.
  55.  
stringlist is nice but for small files. and uses a lot of resources for nothing. Okay, things already built in like sort, but in this way you actually see what you're doing.
Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 01:32:40 pm
Well, I'm glad to see that I'm not the only one who believes that TStringList is not quite suitable for this task (if only because unixtime is a number, and sorting numbers is much faster than sorting strings). However, the task of counting duplicate values ​​is fairly common, and the standard library should have a ready solution for it.
Title: Re: Sorting and Counting
Post by: howardpc on July 17, 2019, 02:13:18 pm
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal  [Select][+][-]
  1. unit mainSortCount;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.    Classes, SysUtils, Forms, StdCtrls;
  9.  
  10. type
  11.    TForm1 = class(TForm)
  12.      Memo1: TMemo;
  13.      procedure FormCreate(Sender: TObject);
  14.    end;
  15.  
  16. var
  17.    Form1: TForm1;
  18.  
  19.   procedure SortCount(const anInFile: String; out aList: TStringList);
  20.  
  21.   procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  22.  
  23. implementation
  24.  
  25. {$R *.lfm}
  26.  
  27. { TForm1 }
  28.  
  29. procedure SortCount(const anInFile: String; out aList: TStringList);
  30. const
  31.   one = PtrUInt(1);
  32. var
  33.   textf: TextFile;
  34.   s: String;
  35.   idx: Integer;
  36.  
  37.   function GetSuccObj(anIntObj: TObject): TObject;
  38.   var
  39.     i: PtrUInt absolute anIntObj;
  40.   begin
  41.     Inc(i);
  42.     Exit(anIntObj);
  43.   end;
  44.  
  45. begin
  46.   Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  47.   aList := TStringList.Create;
  48.   aList.Duplicates := dupError;
  49.   aList.Sorted := True;
  50.   AssignFile(textf, anInFile);
  51.   try
  52.     Reset(textf);
  53.     while not EOF(textf) do
  54.       begin
  55.         ReadLn(textf, s);
  56.         s := Trim(s);
  57.         idx := aList.IndexOf(s);
  58.         case idx of
  59.           -1: aList.AddObject(s, TObject(one));
  60.           else
  61.             aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
  62.         end;
  63.       end;
  64.   finally
  65.     CloseFile(textf);
  66.   end;
  67. end;
  68.  
  69. procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  70. var
  71.   i: Integer;
  72. begin
  73.   if Assigned(aList) and Assigned(aMemo) then
  74.     begin
  75.       aMemo.Clear;
  76.       for i := 0 to aList.Count-1 do
  77.         aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
  78.     end;
  79. end;
  80.  
  81. procedure TForm1.FormCreate(Sender: TObject);
  82. var
  83.   sl: TStringList;
  84. begin
  85.   SortCount('infile.txt', sl);
  86.   try
  87.     ShowListInMemo(sl, Memo1);
  88.     Memo1.Lines.SaveToFile('outfile.txt');
  89.   finally
  90.     sl.Free;
  91.   end;
  92. end;
  93.  
  94. end.
Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 03:20:20 pm
Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's  one(22 s.)  and it is prohibitively slow for a task of this size.
Title: Re: Sorting and Counting
Post by: julkas on July 17, 2019, 05:29:50 pm
GMap is fast enough. I will post my random test tomorrow.
Title: Re: Sorting and Counting
Post by: marcov on July 17, 2019, 05:56:51 pm
gmap might depend on the number of unique values. If that gets large it might also slow down.
Title: Re: Sorting and Counting
Post by: julkas on July 17, 2019, 06:07:38 pm
gmap might depend on the number of unique values. If that gets large it might also slow down.
About ~ 10000000 random longint values.
Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 06:40:45 pm
@julkas, to generate a test file, I used this:
Code: Pascal  [Select][+][-]
  1. procedure CreateTestFile(const aFileName: string);
  2. var
  3.   I: Integer;
  4. const
  5.   TestSize = 1750000;
  6. begin
  7.   with TStringList.Create do
  8.     try
  9.       for I := 1 to TestSize do
  10.         Add(IntToStr(1500000000 + Random(800000)));
  11.       SaveToFile(aFileName);
  12.     finally
  13.       Free;
  14.     end;
  15. end;
  16.  
The output is a ~20MB text file, containing ~700000 unique values.
Title: Re: Sorting and Counting
Post by: Akira1364 on July 17, 2019, 08:42:08 pm
Here's a example / benchmark of a version that uses Generics.Collections (available by default in trunk FPC, but also compatible with 3.0.4 if you just copy the sources to somewhere in whatever your unit search path is.)

Performance seems to be quite good (runs in around 0.8 - 1.1 seconds on average for me.)

Code: Pascal  [Select][+][-]
  1. program Example;
  2.  
  3. {$mode Delphi}
  4.  
  5. uses
  6.   SysUtils, DateUtils,
  7.   Generics.Defaults, Generics.Collections;
  8.  
  9. type
  10.   TIntPair = TPair<LongInt, LongInt>;
  11.   TIntMap = TDictionary<LongInt, LongInt>;
  12.  
  13.   function ComparePairs(constref L, R: TIntPair): LongInt;
  14.   begin
  15.     if L.Key < R.Key then
  16.       Result := -1
  17.     else if L.Key = R.Key then
  18.       Result := 0
  19.     else
  20.       Result := 1;
  21.   end;
  22.  
  23. var
  24.   I: LongInt;
  25.   Start: TDateTime;
  26.   InFile, OutFile: Text;
  27.   Map: TIntMap;
  28.   Pair: TIntPair;
  29.   Pairs: TArray<TIntPair>;
  30.  
  31. begin
  32.   // Generate a random test file first, with an adaptation of avk's method.
  33.   Randomize();
  34.   Assign(InFile, 'data.txt');
  35.   Rewrite(InFile);
  36.   for I := 1 to 1750000 do
  37.     WriteLn(InFile, 1500000000 + Random(800000));
  38.   Close(InFile);
  39.   I := 0;
  40.   // Start the timer only now, as this is where the real work starts.
  41.   Start := Now();
  42.   Map := TIntMap.Create();
  43.   // Allocate a big chunk of memory here in advance, for performance's sake.
  44.   // Doesn't matter if it's more than you end up needing, as Capacity is separate from Count.
  45.   Map.Capacity := 1750000;
  46.   Assign(InFile, 'data.txt');
  47.   Reset(InFile);
  48.   while not EOF(InFile) do begin
  49.     ReadLn(InFile, I);
  50.     if not Map.ContainsKey(I) then
  51.       Map.Add(I, 1)
  52.     else
  53.       Map[I] := Map[I] + 1;
  54.   end;
  55.   Close(InFile);
  56.   Pairs := Map.ToArray();
  57.   TArrayHelper<TIntPair>.Sort(
  58.     Pairs,
  59.     TComparer<TIntPair>.Construct(ComparePairs)
  60.   );
  61.   Assign(OutFile, 'output.txt');
  62.   Rewrite(OutFile);
  63.   for Pair in Pairs do with Pair do
  64.     WriteLn(OutFile, Key, ' - ', Value);
  65.   Close(OutFile);
  66.   Map.Free();
  67.   WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  68. end.
Title: Re: Sorting and Counting
Post by: howardpc on July 17, 2019, 08:47:16 pm
Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's  one(22 s.)  and it is prohibitively slow for a task of this size.
I have changed my mind.

A generic container like TStringList is not well suited to this task on large files.
Here follows a simple dynamic array solution. The following routine executes in less than 2s on my average 5-year-old machine, using a test file generated from your code.
Code: Pascal  [Select][+][-]
  1. program SortCountlpi;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. uses
  6.   Classes, sysutils;
  7.  
  8. procedure CreateTestFile(const aFileName: string);
  9. var
  10.   I: Integer;
  11. const
  12.   TestSize = 1750000;
  13. begin
  14.   with TStringList.Create do
  15.     try
  16.       for I := 1 to TestSize do
  17.         Add(IntToStr(1500000000 + Random(800000)));
  18.       SaveToFile(aFileName);
  19.     finally
  20.       Free;
  21.     end;
  22. end;
  23.  
  24. procedure SortCountViaArray(const anInfile, anOutFile: String);
  25. var
  26.   arr: array of Integer;
  27.   textf: TextFile;
  28.   min: Integer = High(Integer);
  29.   max: Integer = -1;
  30.   i: Integer;
  31. begin
  32.   if FileExists(anInfile) then begin
  33.     AssignFile(textf, anInfile);
  34.     Reset(textf);
  35.     while not EOF(textf) do
  36.       begin
  37.         ReadLn(textf, i);
  38.         if i < min then
  39.           min := i;
  40.         if i > max then
  41.           max := i;
  42.       end;
  43.     SetLength(arr, max-min+1);
  44.  
  45.     Reset(textf);
  46.     while not EOF(textf) do
  47.       begin
  48.         ReadLn(textf, i);
  49.         Dec(i, min);
  50.         Inc(arr[i]);
  51.       end;
  52.     CloseFile(textf);
  53.  
  54.     AssignFile(textf, anOutFile);
  55.     Rewrite(textf);
  56.     for i := Low(arr) to High(arr) do
  57.       case (arr[i] > 0) of
  58.         True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  59.       end;
  60.     CloseFile(textf);
  61.     SetLength(arr, 0);
  62.   end;
  63. end;
  64.  
  65. var
  66.   t: TDateTime;
  67.  
  68. begin
  69.   CreateTestFile('infile.txt');
  70.   t := Time;
  71.   SortCountViaArray('infile.txt', 'outfile.txt');
  72.   Writeln('Elapsed time: ',DateTimeToTimeStamp(Time -t).Time,' ms');
  73. end.
Title: Re: Sorting and Counting
Post by: mpknap on July 17, 2019, 09:28:26 pm
Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.

:)
Title: Re: Sorting and Counting
Post by: jamie on July 17, 2019, 11:13:25 pm
If you want faster you need to create a single chuck of memory that will house all of the file at once..

Load the file into that memory and write some code that shuffles the lines around. The idea is to not allow the memory manager to keep fragmentating the operation.

 Also putting it in a more simpler way, your fields are all the same size which makes this easy and since they are numbers only they can be converted into a Binary number and placed in a array to suites your needs..

  Once in the array you sort the array with simple algorithms.
 
 So step one is the read the file in line by line, convert each entry to a binary number and then store it in the
array..

 Once you have this data stored you then sort the array using something like a bubble sort.

 The array size can be calculated a head of time so you can dynamically create it..

 ArraySize := FileSize Div_NUmber_Chars_PerEntry_Plus2_For CRLF;

 So your array could be this
Array of Int64;
SetLength(MyArray, ArraySize);

etc.
the rest is just code...

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 12:30:37 am
A generic container like TStringList is not well suited to this task on large files.

Well, an actually-generic hashmap that deals directly with the relevant type as opposed to unnecessarily stringifying the values is definitely well-suited to it. Your array solution is certainly quite fast though.

Here's both of ours wrapped up into a single command-line app with an option to tell it which to use:

Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5.  
  6. uses
  7.   SysUtils, DateUtils,
  8.   Generics.Defaults, Generics.Collections;
  9.  
  10. type
  11.   TIntPair = TPair<LongInt, LongInt>;
  12.  
  13.   function ComparePairs(constref L, R: TIntPair): LongInt;
  14.   begin
  15.     if L.Key < R.Key then
  16.       Result := -1
  17.     else if L.Key = R.Key then
  18.       Result := 0
  19.     else
  20.       Result := 1;
  21.   end;
  22.  
  23.   procedure SortCountAkira;
  24.   var
  25.     I: LongInt;
  26.     InOut: Text;
  27.     Map: TDictionary<LongInt, LongInt>;
  28.     Pair: TIntPair;
  29.     Pairs: TArray<TIntPair>;
  30.   begin
  31.     if FileExists(ParamStr(2)) then begin
  32.       Map := TDictionary<LongInt, LongInt>.Create();
  33.       Map.Capacity := 10000000;
  34.       Assign(InOut, ParamStr(2));
  35.       Reset(InOut);
  36.       while not EOF(InOut) do begin
  37.         ReadLn(InOut, I);
  38.         if not Map.ContainsKey(I) then
  39.           Map.Add(I, 1)
  40.         else
  41.           Map[I] := Map[I] + 1;
  42.       end;
  43.       Close(InOut);
  44.       Pairs := Map.ToArray();
  45.       Map.Free();
  46.       TArrayHelper<TIntPair>.Sort(
  47.         Pairs,
  48.         TComparer<TIntPair>.Construct(ComparePairs)
  49.       );
  50.       Assign(InOut, ParamStr(3));
  51.       Rewrite(InOut);
  52.       for Pair in Pairs do with Pair do
  53.         WriteLn(InOut, Key, ' - ', Value);
  54.       Close(InOut);
  55.     end;
  56.   end;
  57.  
  58.   procedure SortCountHoward;
  59.   var
  60.     arr: array of Integer;
  61.     textf: TextFile;
  62.     min: Integer = High(Integer);
  63.     max: Integer = -1;
  64.     i: Integer;
  65.   begin
  66.     if FileExists(ParamStr(2)) then begin
  67.       AssignFile(textf, ParamStr(2));
  68.       Reset(textf);
  69.       while not EOF(textf) do
  70.         begin
  71.           ReadLn(textf, i);
  72.           if i < min then
  73.             min := i;
  74.           if i > max then
  75.             max := i;
  76.         end;
  77.       SetLength(arr, max-min+1);
  78.  
  79.       Reset(textf);
  80.       while not EOF(textf) do
  81.         begin
  82.           ReadLn(textf, i);
  83.           Dec(i, min);
  84.           Inc(arr[i]);
  85.         end;
  86.       CloseFile(textf);
  87.  
  88.       AssignFile(textf, ParamStr(3));
  89.       Rewrite(textf);
  90.       for i := Low(arr) to High(arr) do
  91.         case (arr[i] > 0) of
  92.           True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  93.         end;
  94.       CloseFile(textf);
  95.       SetLength(arr, 0);
  96.     end;
  97.   end;
  98.  
  99. var
  100.   Start: TDateTime;
  101.  
  102. begin
  103.   if ParamCount() <> 3 then
  104.     WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename')
  105.   else if ParamStr(1) = '-akira' then
  106.   begin
  107.     Start := Now();
  108.     SortCountAkira();
  109.     WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  110.   end
  111.   else if ParamStr(1) = '-howard' then
  112.   begin
  113.     Start := Now();
  114.     SortCountHoward();
  115.     WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  116.   end
  117.   else
  118.     WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename');  
  119. end.

Also a generator program for the input file (which gives something a little more heavyweight than what people have been using so far):

Code: Pascal  [Select][+][-]
  1. program InputGenerator;
  2.  
  3. var
  4.   InFile: Text;
  5.   I: LongInt;
  6.  
  7. begin
  8.   Randomize();
  9.   Assign(InFile, 'infile.txt');
  10.   Rewrite(InFile);
  11.   for I := 1 to 10000000 do
  12.     WriteLn(InFile, 1500000000 + Random(800000));
  13.   Close(InFile);
  14. end.

The TXT file is large, sorting takes more than an hour.

Wait, really? That's thousands and thousands of times too long if it's only 20MB or so.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 03:41:54 am
@Akira1364, it seems your InputGenerator creates more than 99% of unique values.
Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 04:13:47 am
@Akira1364, it seems your InputGenerator creates more than 99% of unique values.

I was kind of going for testing the "worst case scenario", but it does seem a bit too worse case now that you point it out. I just edited it a bit in the comment.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 07:43:50 am
I slightly changed the Akira1364's program:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5.  
  6. uses
  7.   SysUtils, DateUtils,
  8.   Generics.Defaults, Generics.Collections,
  9.   LGUtils, , LGAbstractContainer, LGHashMultiSet, LGArrayHelpers;
  10.  
  11. type
  12.   TIntPair = TPair<LongInt, LongInt>;
  13.  
  14.   function ComparePairs(constref L, R: TIntPair): LongInt;
  15.   begin
  16.     if L.Key < R.Key then
  17.       Result := -1
  18.     else if L.Key = R.Key then
  19.       Result := 0
  20.     else
  21.       Result := 1;
  22.   end;
  23.  
  24. var
  25.   Total, Unique: Integer;
  26.   Start: TDateTime;
  27.  
  28.   procedure SortCountAkira;
  29.   var
  30.     I: LongInt;
  31.     InOut: Text;
  32.     Map: TDictionary<LongInt, LongInt>;
  33.     Pair: TIntPair;
  34.     Pairs: TArray<TIntPair>;
  35.   begin
  36.     Map := TDictionary<LongInt, LongInt>.Create();
  37.     Map.Capacity := 10000000;
  38.     Assign(InOut, ParamStr(1));
  39.     Reset(InOut);
  40.     while not EOF(InOut) do begin
  41.       ReadLn(InOut, I);
  42.       Inc(Total);
  43.       if not Map.ContainsKey(I) then
  44.         begin
  45.           Map.Add(I, 1);
  46.           Inc(Unique);
  47.         end
  48.       else
  49.         Map[I] := Map[I] + 1;
  50.     end;
  51.     Close(InOut);
  52.     Pairs := Map.ToArray();
  53.     Map.Free();
  54.     TArrayHelper<TIntPair>.Sort(
  55.       Pairs,
  56.       TComparer<TIntPair>.Construct(ComparePairs)
  57.     );
  58.     Assign(InOut, ParamStr(2));
  59.     Rewrite(InOut);
  60.     for Pair in Pairs do with Pair do
  61.       WriteLn(InOut, Key, ' - ', Value);
  62.     Close(InOut);
  63.   end;
  64.  
  65.   procedure SortCountHoward;
  66.   var
  67.     arr: array of Integer;
  68.     textf: TextFile;
  69.     min: Integer = High(Integer);
  70.     max: Integer = -1;
  71.     i: Integer;
  72.   begin
  73.     AssignFile(textf, ParamStr(1));
  74.     Reset(textf);
  75.     while not EOF(textf) do
  76.       begin
  77.         ReadLn(textf, i);
  78.         Inc(Total);
  79.         if i < min then
  80.           min := i;
  81.         if i > max then
  82.           max := i;
  83.       end;
  84.     SetLength(arr, max-min+1);
  85.  
  86.     Reset(textf);
  87.     while not EOF(textf) do
  88.       begin
  89.         ReadLn(textf, i);
  90.         Dec(i, min);
  91.         Inc(arr[i]);
  92.       end;
  93.     CloseFile(textf);
  94.  
  95.     AssignFile(textf, ParamStr(2));
  96.     Rewrite(textf);
  97.     for i := Low(arr) to High(arr) do
  98.       case (arr[i] > 0) of
  99.         True:
  100.           begin
  101.             WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  102.             Inc(Unique);
  103.           end;
  104.       end;
  105.     CloseFile(textf);
  106.     SetLength(arr, 0);
  107.   end;
  108.  
  109.   function EntryCmp(constref L, R: TGMultisetEntry<Integer>): SizeInt;
  110.   begin
  111.     if L.Key > R.Key then
  112.       Result := 1
  113.     else
  114.       if L.Key < R.Key then
  115.         Result := -1
  116.       else
  117.         Result := 0;
  118.   end;  
  119.  
  120.   procedure SortCountAvk1;
  121.   type
  122.     TCounter  = TGHashMultiSetLP<Integer>;
  123.     TCountRef = TGAutoRef<TCounter>;
  124.     TEntry    = TCounter.TEntry;
  125.   var
  126.     CountRef: TCountRef;
  127.     InOut: Text;
  128.     Counter: TCounter;
  129.     e: TEntry;
  130.     I: Integer;
  131.   begin
  132.     Counter := CountRef;
  133.     Assign(InOut, ParamStr(1));
  134.     Reset(InOut);
  135.     while not EOF(InOut) do
  136.       begin
  137.         ReadLn(InOut, I);
  138.         Counter.Add(I);
  139.       end;
  140.     Close(InOut);
  141.     Total := Counter.Count;
  142.     Unique := Counter.EntryCount;
  143.     if Counter.NonEmpty then
  144.       begin
  145.         Assign(InOut, ParamStr(2));
  146.         Rewrite(InOut);
  147.         for e in Counter.Entries.Sorted(@EntryCmp) do
  148.           with e do
  149.              WriteLn(InOut, Key, ' - ', Count);
  150.         Close(InOut);
  151.       end;
  152.   end;
  153.  
  154.   procedure SortCountAvk2;
  155.   var
  156.     List: array of Integer;
  157.     InOut: Text;
  158.     I, J, Count, DupCount: Integer;
  159.   begin
  160.     Assign(InOut, ParamStr(1));
  161.     Reset(InOut);
  162.     SetLength(List, 4096);
  163.     I := 0;
  164.     while not EOF(InOut) do
  165.       begin
  166.         ReadLn(InOut, J);
  167.         Inc(Total);
  168.         if Length(List) = I then
  169.           SetLength(List, I * 2);
  170.         List[I] := J;
  171.         Inc(I);
  172.       end;
  173.     Close(InOut);
  174.     SetLength(List, I);
  175.     if List = nil then
  176.       exit;
  177.     TGOrdinalArrayHelper<Integer>.Sort(List);
  178.     Count := I;
  179.     DupCount := 0;
  180.     I := 0;
  181.     Assign(InOut, ParamStr(2));
  182.     Rewrite(InOut);
  183.     repeat
  184.       J := List[I];
  185.       while (I < Count) and (List[I] = J) do
  186.         begin
  187.           Inc(DupCount);
  188.           Inc(I);
  189.         end;
  190.       WriteLn(InOut, J, ' - ', DupCount);
  191.       Inc(Unique);
  192.       DupCount := 0;
  193.     until I = Count;
  194.     Close(InOut);
  195.   end;
  196.  
  197.   procedure Run(aProc: TProcedure);
  198.   begin
  199.     Total := 0;
  200.     Unique := 0;
  201.     Start := Now;
  202.     try
  203.       aProc();
  204.       WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  205.       WriteLn('total values: ', Total, ', unique values: ', Unique);
  206.     except
  207.       on e: Exception do
  208.         WriteLn('crashes with message "', e.Message, '"');
  209.     end;
  210.   end;
  211.  
  212. begin
  213.   if ParamCount <> 2 then
  214.     begin
  215.       WriteLn('Usage: occurrencecounter infilename outfilename');
  216.       exit;
  217.     end;
  218.   if not FileExists(ParamStr(1)) then
  219.     begin
  220.       WriteLn('Input file "', ParamStr(1), '" not found');
  221.       exit;
  222.     end;
  223.  
  224.   WriteLn('running SortCountAkira:');
  225.   Run(@SortCountAkira);
  226.   WriteLn;
  227.  
  228.   WriteLn('running SortCountHoward:');
  229.   Run(@SortCountHoward);
  230.   WriteLn;
  231.  
  232.  
  233.   WriteLn('running SortCountAvk1:');
  234.   Run(@SortCountAvk1);
  235.   WriteLn;
  236.  
  237.   WriteLn('running SortCountAvk2:');
  238.   Run(@SortCountAvk2);
  239. end.
  240.  
using current version of his InputGenerator output is(32-bit compiler):
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 4.7730
  3. total values: 10000000, unique values: 799999
  4.  
  5. running SortCountHoward:
  6. elapsed time: 6.0220
  7. total values: 10000000, unique values: 799999
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 3.7910
  11. total values: 10000000, unique values: 799999
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 2.9480
  15. total values: 10000000, unique values: 799999
  16.  
using previous version:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 14.5700
  3. total values: 10000000, unique values: 9976599
  4.  
  5. running SortCountHoward:
  6. crashes with message "Invalid pointer operation"
  7.  
  8. running SortCountAvk1:
  9. elapsed time: 8.1750
  10. total values: 10000000, unique values: 9976599
  11.  
  12. running SortCountAvk2:
  13. elapsed time: 6.6450
  14. total values: 10000000, unique values: 9976599
  15.  
Title: Re: Sorting and Counting
Post by: Thaddy on July 18, 2019, 08:37:51 am
@ AVK
You can optimize a little with {$I-}{$H-} or did you already do that? {$I-} speeds up writeln consideably.
Also I wouldn't use Randomize() but use a fixed seed, so the file becomes reproducable.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 09:25:34 am
@Thaddy, I think it does not make much sense. The idea was to show that for this task there are ways to solve without shamanic dances with a tambourine and in a reasonable time.
Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 10:00:23 am
GMap random test. Out file size ~ 146 MB
Code: Pascal  [Select][+][-]
  1. program sc;
  2. {$mode delphi}
  3. uses gmap, gutil, SysUtils;
  4. const
  5.   keyNum = 10000000;
  6. type
  7.   TIntLess = TLess<LongInt>;
  8.   TDict = TMap<LongInt, LongInt, TIntLess>;
  9. var
  10.   sc: TDict;
  11.   scit: TDict.TIterator;
  12.   i: LongInt;
  13.   key, cnt: LongInt;
  14.   start: QWord;
  15.   outFile: Text;
  16. begin
  17.   start := GetTickCount64();
  18.   sc := TDict.Create;
  19.   for i := 0 to keyNum do
  20.   begin
  21.     key := Random(2147483647);
  22.     cnt := 0;
  23.     sc.TryGetValue(key, cnt);
  24.     sc[key] := cnt + 1;
  25.   end;
  26.   WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  27.   WriteLn('Uniq keys - ', sc.Size, ', out of - ', keyNum);
  28.   Assign(outFile, 'out.txt');
  29.   Rewrite(outFile);
  30.   scit := sc.Min;
  31.   repeat
  32.     WriteLn(outFile, scit.Key, ' - ', scit.Value)
  33.   until not scit.Next;
  34.   Close(outFile);
  35.   scit.Destroy;
  36.   sc.Destroy;
  37.   WriteLn('Total (ticks) - ', GetTickCount64() - start);
  38.   ReadLn;
  39. end.
  40.  
Console output -
Code: Text  [Select][+][-]
  1. Populated (ticks) - 17297
  2. Uniq keys - 9976566, out of - 10000000
  3. Total (ticks) - 23906
Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 11:06:58 am
Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.
Can you share input file?
Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 12:39:15 pm
I did this for more than 25 million postal codes. It was really slow with the default containers and took a lot of memory. I made a custom class that used a smart way to insert new values, which reduced it to less than 15 minutes and 2 GB memory usage. But I stored more than one number in more than one container. Requesting a postal code was nearly instant.

I kept a small, loose index and remembered the last few insertions. After that, I just tried to insert a new entry halfway between the two nearest indexed items, and then repeatedly halfway until a hit was found. Which took less than 5 tries on average.
Title: Re: Sorting and Counting
Post by: Thaddy on July 18, 2019, 12:49:02 pm
It may be a lot quicker with a memory mapped file approach. Especially on very long files. Then the readln is from memory most of the time, not from disk.
Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 01:22:15 pm
I made my own SaveToFile and LoadFromFile, which created a text file that was 4.5 GB in size, IIRC. It took 2 minutes to load that into memory. But less than a millisecond to query.

Btw, this was on a fast Linux sever with lots of memory and SSDs.
Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 01:48:32 pm
In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby.
Check also this - https://en.wikipedia.org/wiki/External_sorting.
Title: Re: Sorting and Counting
Post by: mangakissa on July 18, 2019, 02:19:49 pm
My sample with quicksort and a 30 mb file size takes 69 seconds WITHOUT ANY TSTRINGLIST.
Just old program developping.
Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 02:37:03 pm
Storage in memory is also interesting. A linked list allows fast inserts, but is slow to access. An array has to be shifted. And indexes consist of arrays as well.

IIRC, I ended up with 100 separate arrays that all did a subset. That was fastest overall. And finding the right one was based on quicksort.
Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 02:49:45 pm
I slightly changed the Akira1364's program:

Yeah, LGenerics is really good. The MultiSet is definitely an even more appropriate data structure than a hashmap (which TDictionary is) for this.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 04:05:49 pm
Yeah, LGenerics is really good...
Nice to hear kindly words from you. :)
Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 04:19:41 pm
Nice to hear kindly words from you. :)

Oh, I didn't realize you were the author! Great job on it, again. It's by far the most advanced / feature-complete generics library ever written in Object Pascal.

On another note, I'm still extremely curious as to what "mpknap" is doing with his file currently that takes over an hour... is he using BogoSort or something? That's just crazily long.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 04:28:05 pm
Thank you very much.
@mangakissa, can you submit full version of your solution?
Title: Re: Sorting and Counting
Post by: mangakissa on July 18, 2019, 04:52:20 pm
the whole project  :D
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 06:33:03 pm
@julkas, @mangakissa
I changed your code a bit and pasted it into the Akira1364's app:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode Delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, gmap;
  12.  
  13. type
  14.   TIntPair = TPair<LongInt, LongInt>;
  15.  
  16.   function ComparePairs(constref L, R: TIntPair): LongInt;
  17.   begin
  18.     if L.Key < R.Key then
  19.       Result := -1
  20.     else if L.Key = R.Key then
  21.       Result := 0
  22.     else
  23.       Result := 1;
  24.   end;
  25.  
  26. var
  27.   Total, Unique: Integer;
  28.   Start: TDateTime;
  29.  
  30.   procedure SortCountAkira;
  31.   var
  32.     I: LongInt;
  33.     InOut: Text;
  34.     Map: TDictionary<LongInt, LongInt>;
  35.     Pair: TIntPair;
  36.     Pairs: TArray<TIntPair>;
  37.   begin
  38.     Map := TDictionary<LongInt, LongInt>.Create();
  39.     Map.Capacity := 10000000;
  40.     Assign(InOut, ParamStr(1));
  41.     Reset(InOut);
  42.     while not EOF(InOut) do begin
  43.       ReadLn(InOut, I);
  44.       Inc(Total);
  45.       if not Map.ContainsKey(I) then
  46.         begin
  47.           Map.Add(I, 1);
  48.           Inc(Unique);
  49.         end
  50.       else
  51.         Map[I] := Map[I] + 1;
  52.     end;
  53.     Close(InOut);
  54.     Pairs := Map.ToArray();
  55.     Map.Free();
  56.     TArrayHelper<TIntPair>.Sort(
  57.       Pairs,
  58.       TComparer<TIntPair>.Construct(ComparePairs)
  59.     );
  60.     Assign(InOut, ParamStr(2));
  61.     Rewrite(InOut);
  62.     for Pair in Pairs do with Pair do
  63.       WriteLn(InOut, Key, ' - ', Value);
  64.     Close(InOut);
  65.   end;
  66.  
  67.   procedure SortCountHoward;
  68.   var
  69.     arr: array of Integer;
  70.     textf: TextFile;
  71.     min: Integer = High(Integer);
  72.     max: Integer = -1;
  73.     i: Integer;
  74.   begin
  75.     AssignFile(textf, ParamStr(1));
  76.     Reset(textf);
  77.     while not EOF(textf) do
  78.       begin
  79.         ReadLn(textf, i);
  80.         Inc(Total);
  81.         if i < min then
  82.           min := i;
  83.         if i > max then
  84.           max := i;
  85.       end;
  86.     SetLength(arr, max-min+1);
  87.  
  88.     Reset(textf);
  89.     while not EOF(textf) do
  90.       begin
  91.         ReadLn(textf, i);
  92.         Dec(i, min);
  93.         Inc(arr[i]);
  94.       end;
  95.     CloseFile(textf);
  96.  
  97.     AssignFile(textf, ParamStr(2));
  98.     Rewrite(textf);
  99.     for i := Low(arr) to High(arr) do
  100.       case (arr[i] > 0) of
  101.         True:
  102.           begin
  103.             WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  104.             Inc(Unique);
  105.           end;
  106.       end;
  107.     CloseFile(textf);
  108.     SetLength(arr, 0);
  109.   end;
  110.  
  111.   procedure SortCountAvk1;
  112.   type
  113.     TCounter  = TGHashMultiSetLP<Integer>;
  114.     TCountRef = TGAutoRef<TCounter>;
  115.     TEntry    = TCounter.TEntry;
  116.  
  117.     function EntryCmp(constref L, R: TEntry): SizeInt;
  118.     begin
  119.       if L.Key > R.Key then
  120.         Result := 1
  121.       else
  122.         if L.Key < R.Key then
  123.           Result := -1
  124.         else
  125.           Result := 0;
  126.     end;
  127.  
  128.   var
  129.     CountRef: TCountRef;
  130.     InOut: Text;
  131.     Counter: TCounter;
  132.     e: TEntry;
  133.     I: Integer;
  134.   begin
  135.     Counter := CountRef;
  136.     Counter.LoadFactor := 0.7;
  137.     Assign(InOut, ParamStr(1));
  138.     Reset(InOut);
  139.     while not EOF(InOut) do
  140.       begin
  141.         ReadLn(InOut, I);
  142.         Counter.Add(I);
  143.       end;
  144.     Close(InOut);
  145.     Total := Counter.Count;
  146.     Unique := Counter.EntryCount;
  147.     if Counter.NonEmpty then
  148.       begin
  149.         Assign(InOut, ParamStr(2));
  150.         Rewrite(InOut);
  151.         for e in Counter.Entries.Sorted(EntryCmp) do
  152.           with e do
  153.             WriteLn(InOut, Key, ' - ', Count);
  154.         Close(InOut);
  155.       end;
  156.   end;
  157.  
  158.   procedure SortCountAvk2;
  159.   var
  160.     List: array of Integer;
  161.     InOut: Text;
  162.     I, J, Count, DupCount: Integer;
  163.   begin
  164.     Assign(InOut, ParamStr(1));
  165.     Reset(InOut);
  166.     SetLength(List, 4096);
  167.     I := 0;
  168.     while not EOF(InOut) do
  169.       begin
  170.         ReadLn(InOut, J);
  171.         Inc(Total);
  172.         if Length(List) = I then
  173.           SetLength(List, I * 2);
  174.         List[I] := J;
  175.         Inc(I);
  176.       end;
  177.     Close(InOut);
  178.     SetLength(List, I);
  179.     if List = nil then
  180.       exit;
  181.     TGOrdinalArrayHelper<Integer>.Sort(List);
  182.     Count := I;
  183.     DupCount := 0;
  184.     I := 0;
  185.     Assign(InOut, ParamStr(2));
  186.     Rewrite(InOut);
  187.     repeat
  188.       J := List[I];
  189.       while (I < Count) and (List[I] = J) do
  190.         begin
  191.           Inc(DupCount);
  192.           Inc(I);
  193.         end;
  194.       WriteLn(InOut, J, ' - ', DupCount);
  195.       Inc(Unique);
  196.       DupCount := 0;
  197.     until I = Count;
  198.     Close(InOut);
  199.   end;
  200.  
  201.   procedure SortCountJulkas;
  202.   type
  203.     TIntLess = TLess<LongInt>;
  204.     TDict = TMap<LongInt, LongInt, TIntLess>;
  205.   var
  206.     sc: TDict;
  207.     scit: TDict.TIterator;
  208.     InOut: Text;
  209.     key, cnt: LongInt;
  210.   begin
  211.     sc := TDict.Create;
  212.     Assign(InOut, ParamStr(1));
  213.     Reset(InOut);
  214.     while not EOF(InOut) do
  215.       begin
  216.         ReadLn(InOut, key);
  217.         Inc(Total);
  218.         cnt := 0;
  219.         sc.TryGetValue(key, cnt);
  220.         sc[key] := cnt + 1;
  221.       end;
  222.     Close(InOut);
  223.     Unique := sc.Size;
  224.     if Unique > 0 then
  225.       begin
  226.         Assign(InOut, ParamStr(2));
  227.         Rewrite(InOut);
  228.         scit := sc.Min;
  229.         repeat
  230.           WriteLn(InOut, scit.Key, ' - ', scit.Value);
  231.         until not scit.Next;
  232.         Close(InOut);
  233.         scit.Free;
  234.       end;
  235.     sc.Free;
  236.   end;
  237.  
  238.   procedure SortCountMangakissa;
  239.   type
  240.     TMyTime = packed record
  241.       Unixtime : Integer;
  242.       Counter  : word;
  243.     end;
  244.  
  245.     function Find(var aMyTime : array of TMytime; aLine : Integer) : boolean;
  246.     var index : integer;
  247.     begin
  248.       result := false;
  249.       if length(aMyTime) > 0 then
  250.       begin
  251.         for index := low(aMyTime) to high(aMyTime) do
  252.         begin
  253.           if aMyTime[index].Unixtime = aLine then
  254.           begin
  255.             aMyTime[index].Counter := aMyTime[index].Counter + 1;
  256.             result := true;
  257.             break;
  258.           end;
  259.         end;
  260.       end;
  261.     end;
  262.  
  263.     procedure QuickSort(var A: array of tMytime; iLo, iHi: Integer) ;
  264.      var
  265.        Lo, Hi, Pivot : Integer;
  266.        T             : TMyTime;
  267.      begin
  268.        Lo := iLo;
  269.        Hi := iHi;
  270.        Pivot := A[(Lo + Hi) div 2].Unixtime;
  271.        repeat
  272.          while A[Lo].Unixtime < Pivot do Inc(Lo) ;
  273.          while A[Hi].Unixtime > Pivot do Dec(Hi) ;
  274.          if Lo <= Hi then
  275.          begin
  276.            T := A[Lo];
  277.            A[Lo] := A[Hi];
  278.            A[Hi] := T;
  279.            Inc(Lo) ;
  280.            Dec(Hi) ;
  281.          end;
  282.        until Lo > Hi;
  283.        if Hi > iLo then QuickSort(A, iLo, Hi) ;
  284.        if Lo < iHi then QuickSort(A, Lo, iHi) ;
  285.      end;
  286.   var
  287.     InOut         : Text;
  288.     MyTime        : array of TMyTime = nil;
  289.     Item          : TMyTime;
  290.     myline, index : integer;
  291.   begin
  292.     Assign(InOut, ParamStr(1));
  293.     Reset(InOut);
  294.     index := 0;
  295.     while not EOF(InOut) do
  296.       begin
  297.         ReadLn(InOut, myline);
  298.         Inc(Total);
  299.         if not find(MyTime, myline) then
  300.         begin
  301.           index := index + 1;
  302.           setlength(MyTime,index);
  303.           MyTime[index - 1].Unixtime := myLine;
  304.           MyTime[index - 1].Counter := 1;
  305.         end;
  306.       end;
  307.     Close(InOut);
  308.     if MyTime = nil then
  309.       exit;
  310.     Unique := Length(MyTime);
  311.     QuickSort(Mytime, low(MyTime), high(mytime));
  312.     Assign(InOut, ParamStr(2));
  313.     Rewrite(InOut);
  314.     for Item in Mytime do
  315.       with Item do
  316.         WriteLn(InOut, Unixtime, ' - ', Counter);
  317.     Close(InOut);
  318.   end;
  319.  
  320.   procedure Run(aProc: TProcedure);
  321.   begin
  322.     Total := 0;
  323.     Unique := 0;
  324.     Start := Now;
  325.     try
  326.       aProc();
  327.       WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  328.       WriteLn('#total: ', Total, ', #unique: ', Unique);
  329.     except
  330.       on e: Exception do
  331.         WriteLn('crashes with message "', e.Message, '"');
  332.     end;
  333.   end;
  334.  
  335. begin
  336.   if ParamCount <> 2 then
  337.     begin
  338.       WriteLn('Usage: OccurrenceCounter infilename outfilename');
  339.       exit;
  340.     end;
  341.   if not FileExists(ParamStr(1)) then
  342.     begin
  343.       WriteLn('Input file "', ParamStr(1), '" not found');
  344.       exit;
  345.     end;
  346.  
  347.   WriteLn('running SortCountAkira:');
  348.   Run(SortCountAkira);
  349.   WriteLn;
  350.  
  351.   WriteLn('running SortCountHoward:');
  352.   Run(@SortCountHoward);
  353.   WriteLn;
  354.  
  355.  
  356.   WriteLn('running SortCountAvk1:');
  357.   Run(SortCountAvk1);
  358.   WriteLn;
  359.  
  360.   WriteLn('running SortCountAvk2:');
  361.   Run(SortCountAvk2);
  362.   WriteLn;
  363.  
  364.   WriteLn('running SortCountJulkas:');
  365.   Run(SortCountJulkas);
  366.   WriteLn;
  367.  
  368.   WriteLn('running SortCountMangakissa:');
  369.   Run(SortCountMangakissa);
  370. end.
  371.  
Results for 20MB file:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 1.6700
  3. #total: 1750000, #unique: 710248
  4.  
  5. running SortCountHoward:
  6. elapsed time: 1.5500
  7. #total: 1750000, #unique: 710248
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 1.0100
  11. #total: 1750000, #unique: 710248
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 0.7200
  15. #total: 1750000, #unique: 710248
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 2.6600
  19. #total: 1750000, #unique: 710248
  20.  
  21. running SortCountMangakissa:
  22. elapsed time: 346.6510
  23. #total: 1750000, #unique: 710248
  24.  
  25.  
Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 07:44:00 pm
@Avk:

What do you get if you change the

Code: Pascal  [Select][+][-]
  1. Map.Capacity := 10000000;

in mine to something closer to the expected total for the 20MB file (as opposed to the 100+ MB one.)

Like:

Code: Pascal  [Select][+][-]
  1. Map.Capacity := 1750000;

or whatever. Or even if you just comment out that line completely, as for that much smaller expected input it may not be worth it at all over how TDictionary pre-allocates internally already.

Also FWIW, I'm quite sure that for Mangakissa's version, it's this part:

Code: Pascal  [Select][+][-]
  1. index := index + 1;
  2. setlength(MyTime,index);

that hurts the performance.
Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 07:44:44 pm
@avk thanks for benchmark.
It's old but very good problem. You can learn something new - sorting algorithm, hashing, counting, ...
My approach is not fastest, but clean and short. I use only one class from FPC 3.0.4. May be tomorrow I will post faster variant.
Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 07:56:39 pm
@avk

did you use Akira's InputGenerator program to generate the input file ?

One reason I'm asking is because the code I copied from Akira's post generates a 120MB file.  Did you use a modified version ?  If so, could you also post whatever program you used to generate the input file ?

Thank you.


Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 08:12:57 pm
did you use Akira's InputGenerator program to generate the input file ?

I imagine he used the original "20MB" code of his from an earlier comment for that one.

If you change my generator program to look like this:

Code: Pascal  [Select][+][-]
  1. program InputGenerator;
  2.  
  3. var
  4.   InFile: Text;
  5.   I: LongInt;
  6.  
  7. begin
  8.   Randomize();
  9.   Assign(InFile, 'infile.txt');
  10.   Rewrite(InFile);
  11.   for I := 1 to 1750000 do
  12.     WriteLn(InFile, 1500000000 + Random(800000));
  13.   Close(InFile);
  14. end.

It will create the smaller 20MB file (with 1750000 total entries, versus the 10000000 total entries that the 100+ MB file has.)
Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 08:25:35 pm
I imagine he used the original "20MB" code of his from an earlier comment for that one.
I will use that. Thank you Akira.
Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 08:30:02 pm
What do you get if you change the ...

just commented:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 1.3890
  3. #total: 1750000, #unique: 710248
  4.  
  5. running SortCountHoward:
  6. elapsed time: 1.5600
  7. #total: 1750000, #unique: 710248
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 0.9980
  11. #total: 1750000, #unique: 710248
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 0.7490
  15. #total: 1750000, #unique: 710248
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 2.6210
  19. #total: 1750000, #unique: 710248
  20.  
  21. running SortCountMangakissa:
  22. elapsed time: 342.3730
  23. #total: 1750000, #unique: 710248
  24.  

Map.Capacity := 1750000:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 1.4040
  3. #total: 1750000, #unique: 710248
  4.  
  5. running SortCountHoward:
  6. elapsed time: 1.6130
  7. #total: 1750000, #unique: 710248
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 1.0440
  11. #total: 1750000, #unique: 710248
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 0.7430
  15. #total: 1750000, #unique: 710248
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 2.7800
  19. #total: 1750000, #unique: 710248
  20.  

Also FWIW, I'm quite sure that for Mangakissa's version, it's this part ...
It also seems that the array search has a quadratic complexity.

@440bx
These results was for 20MB file because of SortCountMangakissa.

For current version Akira's InputGenerator:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 5.0910
  3. #total: 10000000, #unique: 799999
  4.  
  5. running SortCountHoward:
  6. elapsed time: 6.3260
  7. #total: 10000000, #unique: 799999
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 4.1010
  11. #total: 10000000, #unique: 799999
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 3.0760
  15. #total: 10000000, #unique: 799999
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 14.3140
  19. #total: 10000000, #unique: 799999
  20.  
Title: Re: Sorting and Counting
Post by: mpknap on July 18, 2019, 08:41:06 pm
Gentlemen, I do not have time to deal with testing your algorithms. I will deal with Monday, on holidays;).
I am also glad that the topic is interesting.

Yesterday I wrote that the Engkin function has been working for an hour. She worked longer until the night was blue screen because the laptop was reset. But the fault was the TXT file. There was an empty line between the lines with the time Stamp. Today I made a TXT file without empty lines and after a few minutes the Out.txt file appeared.
I'm happy because that's what I meant :)

Soon I will check your other ideas. I was unsure that I will get such an encyclopedia from you;)
Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 08:54:57 pm
@440bx
For current version Akira's InputGenerator:
Excellent!.  Thank you Avk.
Title: Re: Sorting and Counting
Post by: howardpc on July 19, 2019, 12:14:51 am
To see what effect varying the size of the data file, and varying the number of unique values to be counted might have on the faster algorithms presented here I adapted Akira1364's program as follows.
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, gmap;
  12.  
  13. type
  14.   TIntPair = TPair<LongInt, LongInt>;
  15.   TProcedureArray = array of procedure;
  16.  
  17.   function ComparePairs(constref L, R: TIntPair): LongInt;
  18.   begin
  19.     if L.Key < R.Key then
  20.       Result := -1
  21.     else if L.Key = R.Key then
  22.       Result := 0
  23.     else
  24.       Result := 1;
  25.   end;
  26.  
  27. var
  28.   Total, Unique, repeatCount, randomrange: Integer;
  29.   Start: TDateTime;
  30.   inFilename: String = 'data.txt';
  31.   outFilename: String = 'sorted.txt';
  32.   routineName: String;
  33.   procedures: TProcedureArray;
  34.   proc: procedure;
  35.  
  36.   procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  37.   var
  38.     InFile: Text;
  39.     I: LongInt;
  40.   begin
  41.     Assign(InFile, inFilename);
  42.     Rewrite(InFile);
  43.     for I := 1 to repeatMillionsCount * 1000000 do
  44.       WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
  45.     Close(InFile);
  46.   end;
  47.  
  48.   procedure SortCountAkira;
  49.   var
  50.     I: LongInt;
  51.     InOut: Text;
  52.     Map: TDictionary<LongInt, LongInt>;
  53.     Pair: TIntPair;
  54.     Pairs: TArray<TIntPair>;
  55.   begin
  56.     routineName := {$I %currentroutine%};
  57.     Map := TDictionary<LongInt, LongInt>.Create();
  58.     Map.Capacity := 10000000;
  59.     Assign(InOut, inFilename);
  60.     Reset(InOut);
  61.     while not EOF(InOut) do begin
  62.       ReadLn(InOut, I);
  63.       Inc(Total);
  64.       if not Map.ContainsKey(I) then
  65.         begin
  66.           Map.Add(I, 1);
  67.           Inc(Unique);
  68.         end
  69.       else
  70.         Map[I] := Map[I] + 1;
  71.     end;
  72.     Close(InOut);
  73.     Pairs := Map.ToArray();
  74.     Map.Free();
  75.     TArrayHelper<TIntPair>.Sort(
  76.       Pairs,
  77.       TComparer<TIntPair>.Construct(ComparePairs)
  78.     );
  79.     Assign(InOut, outFilename);
  80.     Rewrite(InOut);
  81.     for Pair in Pairs do with Pair do
  82.       WriteLn(InOut, Key, ' - ', Value);
  83.     Close(InOut);
  84.   end;
  85.  
  86.   procedure SortCountHoward;
  87.   var
  88.     arr: array of Integer;
  89.     textf: TextFile;
  90.     min: Integer = High(Integer);
  91.     max: Integer = -1;
  92.     i: Integer;
  93.   begin
  94.     routineName := {$I %currentroutine%};
  95.     AssignFile(textf, inFilename);
  96.     Reset(textf);
  97.     while not EOF(textf) do
  98.       begin
  99.         ReadLn(textf, i);
  100.         Inc(Total);
  101.         if i < min then
  102.           min := i;
  103.         if i > max then
  104.           max := i;
  105.       end;
  106.     SetLength(arr, max-min+1);
  107.  
  108.     Reset(textf);
  109.     while not EOF(textf) do
  110.       begin
  111.         ReadLn(textf, i);
  112.         Dec(i, min);
  113.         Inc(arr[i]);
  114.       end;
  115.     CloseFile(textf);
  116.  
  117.     AssignFile(textf, outFilename);
  118.     Rewrite(textf);
  119.     for i := Low(arr) to High(arr) do
  120.       case (arr[i] > 0) of
  121.         True:
  122.           begin
  123.             WriteLn(textf, i+min, ' - ', arr[i]);
  124.             Inc(Unique);
  125.           end;
  126.       end;
  127.     CloseFile(textf);
  128.     SetLength(arr, 0);
  129.   end;
  130.  
  131.   procedure SortCountAvk1;
  132.   type
  133.     TCounter  = TGHashMultiSetLP<Integer>;
  134.     TCountRef = TGAutoRef<TCounter>;
  135.     TEntry    = TCounter.TEntry;
  136.  
  137.     function EntryCmp(constref L, R: TEntry): SizeInt;
  138.     begin
  139.       if L.Key > R.Key then
  140.         Result := 1
  141.       else
  142.         if L.Key < R.Key then
  143.           Result := -1
  144.         else
  145.           Result := 0;
  146.     end;
  147.  
  148.   var
  149.     CountRef: TCountRef;
  150.     InOut: Text;
  151.     Counter: TCounter;
  152.     e: TEntry;
  153.     I: Integer;
  154.   begin
  155.     routineName := {$I %currentroutine%};
  156.     Counter := CountRef;
  157.     Counter.LoadFactor := 0.7;
  158.     Assign(InOut, inFilename);
  159.     Reset(InOut);
  160.     while not EOF(InOut) do
  161.       begin
  162.         ReadLn(InOut, I);
  163.         Counter.Add(I);
  164.       end;
  165.     Close(InOut);
  166.     Total := Counter.Count;
  167.     Unique := Counter.EntryCount;
  168.     if Counter.NonEmpty then
  169.       begin
  170.         Assign(InOut, outFilename);
  171.         Rewrite(InOut);
  172.         for e in Counter.Entries.Sorted(EntryCmp) do
  173.           with e do
  174.             WriteLn(InOut, Key, ' - ', Count);
  175.         Close(InOut);
  176.       end;
  177.   end;
  178.  
  179.   procedure SortCountAvk2;
  180.   var
  181.     List: array of Integer;
  182.     InOut: Text;
  183.     I, J, Count, DupCount: Integer;
  184.   begin
  185.     routineName := {$I %currentroutine%};
  186.     Assign(InOut, inFilename);
  187.     Reset(InOut);
  188.     SetLength(List, 4096);
  189.     I := 0;
  190.     while not EOF(InOut) do
  191.       begin
  192.         ReadLn(InOut, J);
  193.         Inc(Total);
  194.         if Length(List) = I then
  195.           SetLength(List, I * 2);
  196.         List[I] := J;
  197.         Inc(I);
  198.       end;
  199.     Close(InOut);
  200.     SetLength(List, I);
  201.     if List = nil then
  202.       exit;
  203.     TGOrdinalArrayHelper<Integer>.Sort(List);
  204.     Count := I;
  205.     DupCount := 0;
  206.     I := 0;
  207.     Assign(InOut, outFilename);
  208.     Rewrite(InOut);
  209.     repeat
  210.       J := List[I];
  211.       while (I < Count) and (List[I] = J) do
  212.         begin
  213.           Inc(DupCount);
  214.           Inc(I);
  215.         end;
  216.       WriteLn(InOut, J, ' - ', DupCount);
  217.       Inc(Unique);
  218.       DupCount := 0;
  219.     until I = Count;
  220.     Close(InOut);
  221.   end;
  222.  
  223.   procedure SortCountJulkas;
  224.   type
  225.     TIntLess = TLess<LongInt>;
  226.     TDict = TMap<LongInt, LongInt, TIntLess>;
  227.   var
  228.     sc: TDict;
  229.     scit: TDict.TIterator;
  230.     InOut: Text;
  231.     key, cnt: LongInt;
  232.   begin
  233.     routineName := {$I %currentroutine%};
  234.     sc := TDict.Create;
  235.     Assign(InOut, inFilename);
  236.     Reset(InOut);
  237.     while not EOF(InOut) do
  238.       begin
  239.         ReadLn(InOut, key);
  240.         Inc(Total);
  241.         cnt := 0;
  242.         sc.TryGetValue(key, cnt);
  243.         sc[key] := cnt + 1;
  244.       end;
  245.     Close(InOut);
  246.     Unique := sc.Size;
  247.     if Unique > 0 then
  248.       begin
  249.         Assign(InOut, outFilename);
  250.         Rewrite(InOut);
  251.         scit := sc.Min;
  252.         repeat
  253.           WriteLn(InOut, scit.Key, ' - ', scit.Value);
  254.         until not scit.Next;
  255.         Close(InOut);
  256.         scit.Free;
  257.       end;
  258.     sc.Free;
  259.   end;
  260.  
  261.   procedure Run(aProc: TProcedure);
  262.   begin
  263.     Total := 0;
  264.     Unique := 0;
  265.     Start := Now;
  266.     try
  267.       aProc();
  268.       WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
  269.     except
  270.       on e: Exception do
  271.         WriteLn('crashes with message "', e.Message, '"');
  272.     end;
  273.   end;
  274.  
  275. begin
  276.   Randomize;
  277.  
  278.   procedures := TProcedureArray.Create(SortCountJulkas, SortCountAkira, SortCountHoward, SortCountAvk1, SortCountAvk2);
  279.  
  280.   for randomrange := 1 to 10 do
  281.     begin
  282.       GenerateData(randomrange);
  283.       WriteLn(#10'RandomRange = ',randomrange);
  284.       for proc in procedures do
  285.         Run(proc);
  286.     end;
  287.  
  288.   for repeatCount := 1 to 10 do
  289.     begin
  290.       GenerateData(8, 2*repeatCount);
  291.       WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
  292.       for proc in procedures do
  293.         Run(proc);
  294.     end;
  295. end.
Some typical output is as follows:
Code: Pascal  [Select][+][-]
  1. repeatMillionsCount = 16
  2.  Julkas's time: 15.8210 #unique: 800000 #total: 16000000
  3.  Akira's time: 4.7230 #unique: 800000 #total: 16000000
  4.  Howard's time: 4.4930 #unique: 800000 #total: 16000000
  5.   Avk1's time: 3.9390 #unique: 800000 #total: 16000000
  6.    Avk2's time: 2.4480 #unique: 800000 #total: 16000000
  7.  
  8. repeatMillionsCount = 18
  9. Julkas's time: 17.4640 #unique: 800000 #total: 18000000
  10.   Akira's time: 5.2670 #unique: 800000 #total: 18000000
  11. Howard's time: 5.0560 #unique: 800000 #total: 18000000
  12.    Avk1's time: 4.4020 #unique: 800000 #total: 18000000
  13.   Avk2's time: 2.7890 #unique: 800000 #total: 18000000
  14.  
  15. repeatMillionsCount = 20
  16.  Julkas's time: 19.2770 #unique: 800000 #total: 20000000
  17.  Akira's time: 5.7910 #unique: 800000 #total: 20000000
  18.  Howard's time: 5.6010 #unique: 800000 #total: 20000000
  19.   Avk1's time: 4.8510 #unique: 800000 #total: 20000000
  20.    Avk2's time: 3.0690 #unique: 800000 #total: 20000000
For avk to design an algorithm that can analyse a 120 MB text file of 20,000,000 items, sort and count its data and and write out the result to a second file in 3 seconds on my ageing machine is pretty impressive.
Title: Re: Sorting and Counting
Post by: mangakissa on July 19, 2019, 08:36:47 am
Can this be done with {$mode objfpc}?
Title: Re: Sorting and Counting
Post by: Thaddy on July 19, 2019, 09:20:20 am
Can this be done with {$mode objfpc}?
Obviously, yes!, See Howards comments above.
Title: Re: Sorting and Counting
Post by: avk on July 19, 2019, 09:27:31 am
@howardpc, nice, very suitable.
Heh, highly likely my machine is even older than yours(comparing runtimes).
But as for my "algorithm", it seems that you are flattering me, it's just sorting an array and then counting duplicates.
Array helpers do about the same to extract unique(distinct) values.
Your counting sort does the same thing, only at the same time.
I'm more glad for the multiset, it seems it shows itself well.

@mangakissa, sure, just changed my version:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, gmap;
  12.  
  13. type
  14.   TIntPair = specialize TPair<LongInt, LongInt>;
  15.  
  16.   function ComparePairs(constref L, R: TIntPair): LongInt;
  17.   begin
  18.     if L.Key < R.Key then
  19.       Result := -1
  20.     else if L.Key = R.Key then
  21.       Result := 0
  22.     else
  23.       Result := 1;
  24.   end;
  25.  
  26. var
  27.   Total, Unique: Integer;
  28.   Start: TDateTime;
  29.  
  30.   procedure SortCountAkira;
  31.   var
  32.     I: LongInt;
  33.     InOut: Text;
  34.     Map: specialize TDictionary<LongInt, LongInt>;
  35.     Pair: TIntPair;
  36.     Pairs: specialize TArray<TIntPair>;
  37.   begin
  38.     Map := specialize TDictionary<LongInt, LongInt>.Create();
  39.     Map.Capacity := 10000000;
  40.     Assign(InOut, ParamStr(1));
  41.     Reset(InOut);
  42.     while not EOF(InOut) do begin
  43.       ReadLn(InOut, I);
  44.       Inc(Total);
  45.       if not Map.ContainsKey(I) then
  46.         begin
  47.           Map.Add(I, 1);
  48.           Inc(Unique);
  49.         end
  50.       else
  51.         Map[I] := Map[I] + 1;
  52.     end;
  53.     Close(InOut);
  54.     Pairs := Map.ToArray();
  55.     Map.Free();
  56.     specialize TArrayHelper<TIntPair>.Sort(
  57.       Pairs,
  58.       specialize TComparer<TIntPair>.Construct(@ComparePairs)
  59.     );
  60.     Assign(InOut, ParamStr(2));
  61.     Rewrite(InOut);
  62.     for Pair in Pairs do with Pair do
  63.       WriteLn(InOut, Key, ' - ', Value);
  64.     Close(InOut);
  65.   end;
  66.  
  67.   procedure SortCountHoward;
  68.   var
  69.     arr: array of Integer;
  70.     textf: TextFile;
  71.     min: Integer = High(Integer);
  72.     max: Integer = -1;
  73.     i: Integer;
  74.   begin
  75.     AssignFile(textf, ParamStr(1));
  76.     Reset(textf);
  77.     while not EOF(textf) do
  78.       begin
  79.         ReadLn(textf, i);
  80.         Inc(Total);
  81.         if i < min then
  82.           min := i;
  83.         if i > max then
  84.           max := i;
  85.       end;
  86.     SetLength(arr, max-min+1);
  87.  
  88.     Reset(textf);
  89.     while not EOF(textf) do
  90.       begin
  91.         ReadLn(textf, i);
  92.         Dec(i, min);
  93.         Inc(arr[i]);
  94.       end;
  95.     CloseFile(textf);
  96.  
  97.     AssignFile(textf, ParamStr(2));
  98.     Rewrite(textf);
  99.     for i := Low(arr) to High(arr) do
  100.       case (arr[i] > 0) of
  101.         True:
  102.           begin
  103.             WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
  104.             Inc(Unique);
  105.           end;
  106.       end;
  107.     CloseFile(textf);
  108.     SetLength(arr, 0);
  109.   end;
  110.  
  111.   procedure SortCountAvk1;
  112.   type
  113.     TCounter  = specialize TGHashMultiSetLP<Integer>;
  114.     TCountRef = specialize TGAutoRef<TCounter>;
  115.     TEntry    = TCounter.TEntry;
  116.  
  117.     function EntryCmp(constref L, R: TEntry): SizeInt;
  118.     begin
  119.       if L.Key > R.Key then
  120.         Result := 1
  121.       else
  122.         if L.Key < R.Key then
  123.           Result := -1
  124.         else
  125.           Result := 0;
  126.     end;
  127.  
  128.   var
  129.     CountRef: TCountRef;
  130.     InOut: Text;
  131.     Counter: TCounter;
  132.     e: TEntry;
  133.     I: Integer;
  134.   begin
  135.     Counter := CountRef;
  136.     //Counter.LoadFactor := 0.7;
  137.     Assign(InOut, ParamStr(1));
  138.     Reset(InOut);
  139.     while not EOF(InOut) do
  140.       begin
  141.         ReadLn(InOut, I);
  142.         Counter.Add(I);
  143.       end;
  144.     Close(InOut);
  145.     Total := Counter.Count;
  146.     Unique := Counter.EntryCount;
  147.     if Counter.NonEmpty then
  148.       begin
  149.         Assign(InOut, ParamStr(2));
  150.         Rewrite(InOut);
  151.         for e in Counter.Entries.Sorted(@EntryCmp) do
  152.           with e do
  153.             WriteLn(InOut, Key, ' - ', Count);
  154.         Close(InOut);
  155.       end;
  156.   end;
  157.  
  158.   procedure SortCountAvk2;
  159.   var
  160.     List: array of Integer = nil;
  161.     InOut: Text;
  162.     I, J, Count, DupCount: Integer;
  163.   begin
  164.     Assign(InOut, ParamStr(1));
  165.     Reset(InOut);
  166.     SetLength(List, 4096);
  167.     I := 0;
  168.     while not EOF(InOut) do
  169.       begin
  170.         ReadLn(InOut, J);
  171.         Inc(Total);
  172.         if Length(List) = I then
  173.           SetLength(List, I * 2);
  174.         List[I] := J;
  175.         Inc(I);
  176.       end;
  177.     Close(InOut);
  178.     SetLength(List, I);
  179.     if List = nil then
  180.       exit;
  181.     specialize TGOrdinalArrayHelper<Integer>.Sort(List);
  182.     Count := I;
  183.     DupCount := 0;
  184.     I := 0;
  185.     Assign(InOut, ParamStr(2));
  186.     Rewrite(InOut);
  187.     repeat
  188.       J := List[I];
  189.       while (I < Count) and (List[I] = J) do
  190.         begin
  191.           Inc(DupCount);
  192.           Inc(I);
  193.         end;
  194.       WriteLn(InOut, J, ' - ', DupCount);
  195.       Inc(Unique);
  196.       DupCount := 0;
  197.     until I = Count;
  198.     Close(InOut);
  199.   end;
  200.  
  201.   procedure SortCountJulkas;
  202.   type
  203.     TIntLess = specialize TLess<LongInt>;
  204.     TDict = specialize TMap<LongInt, LongInt, TIntLess>;
  205.   var
  206.     sc: TDict;
  207.     scit: TDict.TIterator;
  208.     InOut: Text;
  209.     key, cnt: LongInt;
  210.   begin
  211.     sc := TDict.Create;
  212.     Assign(InOut, ParamStr(1));
  213.     Reset(InOut);
  214.     while not EOF(InOut) do
  215.       begin
  216.         ReadLn(InOut, key);
  217.         Inc(Total);
  218.         cnt := 0;
  219.         sc.TryGetValue(key, cnt);
  220.         sc[key] := cnt + 1;
  221.       end;
  222.     Close(InOut);
  223.     Unique := sc.Size;
  224.     if Unique > 0 then
  225.       begin
  226.         Assign(InOut, ParamStr(2));
  227.         Rewrite(InOut);
  228.         scit := sc.Min;
  229.         repeat
  230.           WriteLn(InOut, scit.Key, ' - ', scit.Value);
  231.         until not scit.Next;
  232.         Close(InOut);
  233.         scit.Free;
  234.       end;
  235.     sc.Free;
  236.   end;
  237.  
  238.   procedure SortCountMangakissa;
  239.   type
  240.     TMyTime = packed record
  241.       Unixtime : Integer;
  242.       Counter  : word;
  243.     end;
  244.  
  245.     function Find(var aMyTime : array of TMytime; aLine : Integer) : boolean;
  246.     var index : integer;
  247.     begin
  248.       result := false;
  249.       if length(aMyTime) > 0 then
  250.       begin
  251.         for index := low(aMyTime) to high(aMyTime) do
  252.         begin
  253.           if aMyTime[index].Unixtime = aLine then
  254.           begin
  255.             aMyTime[index].Counter := aMyTime[index].Counter + 1;
  256.             result := true;
  257.             break;
  258.           end;
  259.         end;
  260.       end;
  261.     end;
  262.  
  263.     procedure QuickSort(var A: array of tMytime; iLo, iHi: Integer) ;
  264.      var
  265.        Lo, Hi, Pivot : Integer;
  266.        T             : TMyTime;
  267.      begin
  268.        Lo := iLo;
  269.        Hi := iHi;
  270.        Pivot := A[(Lo + Hi) div 2].Unixtime;
  271.        repeat
  272.          while A[Lo].Unixtime < Pivot do Inc(Lo) ;
  273.          while A[Hi].Unixtime > Pivot do Dec(Hi) ;
  274.          if Lo <= Hi then
  275.          begin
  276.            T := A[Lo];
  277.            A[Lo] := A[Hi];
  278.            A[Hi] := T;
  279.            Inc(Lo) ;
  280.            Dec(Hi) ;
  281.          end;
  282.        until Lo > Hi;
  283.        if Hi > iLo then QuickSort(A, iLo, Hi) ;
  284.        if Lo < iHi then QuickSort(A, Lo, iHi) ;
  285.      end;
  286.   var
  287.     InOut         : Text;
  288.     MyTime        : array of TMyTime = nil;
  289.     Item          : TMyTime;
  290.     myline, index : integer;
  291.   begin
  292.     Assign(InOut, ParamStr(1));
  293.     Reset(InOut);
  294.     index := 0;
  295.     while not EOF(InOut) do
  296.       begin
  297.         ReadLn(InOut, myline);
  298.         Inc(Total);
  299.         if not find(MyTime, myline) then
  300.         begin
  301.           index := index + 1;
  302.           setlength(MyTime,index);
  303.           MyTime[index - 1].Unixtime := myLine;
  304.           MyTime[index - 1].Counter := 1;
  305.         end;
  306.       end;
  307.     Close(InOut);
  308.     if MyTime = nil then
  309.       exit;
  310.     Unique := Length(MyTime);
  311.     QuickSort(Mytime, low(MyTime), high(mytime));
  312.     Assign(InOut, ParamStr(2));
  313.     Rewrite(InOut);
  314.     for Item in Mytime do
  315.       with Item do
  316.         WriteLn(InOut, Unixtime, ' - ', Counter);
  317.     Close(InOut);
  318.   end;
  319.  
  320.   procedure Run(aProc: TProcedure);
  321.   begin
  322.     Total := 0;
  323.     Unique := 0;
  324.     Start := Now;
  325.     try
  326.       aProc();
  327.       WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  328.       WriteLn('#total: ', Total, ', #unique: ', Unique);
  329.     except
  330.       on e: Exception do
  331.         WriteLn('crashes with message "', e.Message, '"');
  332.     end;
  333.   end;
  334.  
  335. begin
  336.   if ParamCount <> 2 then
  337.     begin
  338.       WriteLn('Usage: OccurrenceCounter infilename outfilename');
  339.       exit;
  340.     end;
  341.   if not FileExists(ParamStr(1)) then
  342.     begin
  343.       WriteLn('Input file "', ParamStr(1), '" not found');
  344.       exit;
  345.     end;
  346.  
  347.   WriteLn('running SortCountAkira:');
  348.   Run(@SortCountAkira);
  349.   WriteLn;
  350.  
  351.   WriteLn('running SortCountHoward:');
  352.   Run(@SortCountHoward);
  353.   WriteLn;
  354.  
  355.  
  356.   WriteLn('running SortCountAvk1:');
  357.   Run(@SortCountAvk1);
  358.   WriteLn;
  359.  
  360.   WriteLn('running SortCountAvk2:');
  361.   Run(@SortCountAvk2);
  362.   WriteLn;
  363.  
  364.   WriteLn('running SortCountJulkas:');
  365.   Run(@SortCountJulkas);
  366.   WriteLn;
  367.  
  368.   WriteLn('running SortCountMangakissa:');
  369.   Run(@SortCountMangakissa);
  370. end.  
  371.  

output:
Code: Text  [Select][+][-]
  1. running SortCountAkira:
  2. elapsed time: 1.4130
  3. #total: 1750000, #unique: 710248
  4.  
  5. running SortCountHoward:
  6. elapsed time: 1.5600
  7. #total: 1750000, #unique: 710248
  8.  
  9. running SortCountAvk1:
  10. elapsed time: 1.0770
  11. #total: 1750000, #unique: 710248
  12.  
  13. running SortCountAvk2:
  14. elapsed time: 0.7170
  15. #total: 1750000, #unique: 710248
  16.  
  17. running SortCountJulkas:
  18. elapsed time: 2.6370
  19. #total: 1750000, #unique: 710248
  20.  
  21. running SortCountMangakissa:
  22. elapsed time: 375.5530
  23. #total: 1750000, #unique: 710248
  24.  
Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 09:31:39 am
Here are a couple of implementations using the Windows API.

The first version does not use Readln.

The second version does not use Readln nor Writeln except for 2 writeln(s) to the console.

NOTE: the input file name is hardcoded to be "infile.txt" for convenience (easy to change that if needed.)
Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 10:28:35 am
Here is my faster (more compicated) version with TVector. Output file ~ 147 MB. Uniq count fixed.
Code: Pascal  [Select][+][-]
  1. program sc2;
  2. {$mode delphi}
  3. uses gvector, gutil, garrayutils, SysUtils;
  4. const
  5.   keyNum = 10000000;
  6. type
  7.   TIntLess = TLess<LongInt>;
  8.   TIntVect = TVector<LongInt>;
  9.   TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  10. var
  11.   sc: array[0..21474] of TIntVect;
  12.   i: LongInt;
  13.   pkey, key, cnt, uniq: LongInt;
  14.   offset: LongInt;
  15.   start: QWord;
  16.   outFile: Text;
  17. begin
  18.   start := GetTickCount64();
  19.   for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  20.  
  21.   for i := 0 to keyNum do
  22.   begin
  23.     key := Random(2147483647);
  24.     sc[key div 100000].PushBack(key mod 100000);
  25.   end;
  26.   WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  27.  
  28.   Assign(outFile, 'out.txt');
  29.   Rewrite(outFile);
  30.  
  31.   for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  32.  
  33.   offset := -100000;
  34.   uniq := 0;
  35.   for i := Low(sc) to High(sc) do
  36.   begin
  37.     Inc(offset, 100000);
  38.     pkey := -1;
  39.     cnt := 0;
  40.     for key in sc[i] do
  41.     begin
  42.       if pkey <> key then
  43.       begin
  44.         if cnt <> 0 then
  45.         begin
  46.           WriteLn(outFile, offset + pkey, ' - ', cnt);
  47.           Inc(uniq);
  48.         end;
  49.         pkey := key;
  50.         cnt := 0;
  51.       end;
  52.       Inc(cnt);
  53.     end;
  54.     if cnt <> 0 then
  55.     begin
  56.       WriteLn(outFile, offset + pkey, ' - ', cnt);
  57.       Inc(uniq);
  58.     end;
  59.   end;
  60.  
  61.   Close(outFile);
  62.   for i := Low(sc) to High(sc) do
  63.     sc[i].Destroy;
  64.  
  65.   WriteLn('Total (ticks) - ', GetTickCount64() - start);
  66.   WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  67.   ReadLn;
  68. end.
Console output.
Code: Text  [Select][+][-]
  1. Populated (ticks) - 1312
  2. Total (ticks) - 4922
  3. Uniq keys - 9976566, out of - 10000000
Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 02:43:43 pm
@Akira1364, @avk can you include my second algo in benchmark and run again? I have 3.0.4 and can't use LGenerics.
Thanks.
Title: Re: Sorting and Counting
Post by: avk on July 19, 2019, 08:10:44 pm
@julkas, done
@440bx:
So, do you offer to compete with the Windows file mapping and ntdll? Okay, why not. :)
To run your SortCount (from SortCount2) inside a Howard's benchmark, I put the SortCount2 code in a separate unit:
Code: Pascal  [Select][+][-]
  1. // LONGSTRINGS ON needed for ParamStr to return a null terminated string
  2.  
  3. {$LONGSTRINGS ON}
  4.  
  5. unit WinSortCount2;
  6.  
  7. interface
  8. uses
  9.   windows,
  10.   sysutils,
  11.   dateutils
  12.   ;
  13.  
  14. const
  15.   ntdll            = 'ntdll.dll';
  16.   kernel32         = 'kernel32.dll';
  17.  
  18. var
  19.   // constants to replace ParamStr(1) and ParamStr(2) respectively
  20.  
  21.   InFileName: string = '';
  22.   OutFileName: string = '';
  23.  
  24.   DataCount : ptruint;
  25.   Unique    : ptruint;
  26.  
  27.   procedure SortCount;
  28.  
  29. implementation
  30. // -----------------------------------------------------------------------------
  31. // input file related types
  32.  
  33. type
  34.   // the input file is made up of unixtimes occupying 10 bytes followed by
  35.   // a CR/LF (on Windows)  To avoid comparing strings (which is slow) we define
  36.   // the unixtime characters as being composed on a qword and a word followed
  37.   // the the CR/LF ending.
  38.  
  39.   TINPUT_FILE_ELEMENT = packed record
  40.     case boolean of
  41.     0 : (
  42.          ife_hi                : qword;       // first 8 bytes of unixtime
  43.          ife_lo                : word;        // trailing 2 bytes of unixtime
  44.  
  45.          ife_LineEnding        : word;        // CRLF
  46.         );
  47.     1 : (
  48.           ife_unixtime         : packed array[0..9] of char;
  49.           ife_crlf             : word;
  50.         );
  51.   end;
  52.   PINPUT_FILE_ELEMENT = ^TINPUT_FILE_ELEMENT;
  53.  
  54.  
  55.  
  56. // -----------------------------------------------------------------------------
  57. // kernel32 related types and functions
  58.  
  59. type
  60.   PSECURITY_ATTRIBUTES = ^TSECURITY_ATTRIBUTES;
  61.   TSECURITY_ATTRIBUTES = record
  62.     Length                      : DWORD;
  63.     PointerToSecurityDescriptor : pointer;
  64.     InheritHandle               : boolean32;
  65.   end;
  66.  
  67.  
  68. function GetFileSizeEx(    FileHandle        : THANDLE;
  69.                        var PointerToFileSize : qword)
  70.          : boolean32; stdcall; external kernel32;
  71.  
  72. function CreateFileA(PointerToFileName           : pchar;
  73.                      DesiredAccess               : DWORD;
  74.                      ShareMode                   : DWORD;
  75.                      PointerToSecurityAttributes : PSECURITY_ATTRIBUTES;
  76.                      CreationDisposition         : DWORD;
  77.                      FlagsAndAttributes          : DWORD;
  78.                      TemplateFileHandle          : THANDLE)
  79.          : THANDLE; stdcall; external kernel32;
  80.  
  81.  
  82. // -----------------------------------------------------------------------------
  83. // ntdll related types and functions
  84.  
  85. type
  86.   TCompareFunction = function (key : pointer; data : pointer) : ptrint; cdecl;
  87.  
  88. const
  89.   COMPARE_EQUAL    =  0;
  90.   COMPARE_GREATER  =  1;
  91.   COMPARE_LESS     = -1;
  92.  
  93.  
  94.  
  95. procedure qsort(Base            : pointer;
  96.                 ElementCount    : ptruint;
  97.                 ElementSize     : ptruint;
  98.                 CompareFunction : TCompareFunction);
  99.           cdecl; external ntdll;
  100.  
  101. procedure RtlMoveMemory(Destination : pointer;
  102.                         Source      : pointer;
  103.                         BytesToCopy : ptruint);
  104.           stdcall; external ntdll;
  105.  
  106.  
  107. const
  108.   CRLF                      = #13#10;
  109.   FORMAT            : pchar = '%10.10s  -  %d' + CRLF;
  110.  
  111.   // if there are no duplicates at all, then the output file will be the size
  112.   // of the input file plus 6 additional characters (see FORMAT above)
  113.  
  114.   ADDITIONAL_OUTPUT         = 6;
  115.  
  116. // NOTE : don't user user32.dll wsprintf, it's a paraplegic dog.
  117.  
  118. function sprintf(OutputDestination : pchar;
  119.                  Format            : pchar;
  120.                  UnixTime          : pchar;
  121.                  Count             : integer) : integer; cdecl;    { CDECL !!}
  122.   external ntdll name 'sprintf';
  123.  
  124. // -----------------------------------------------------------------------------
  125.  
  126. function CompareUnixTimes(EntryA, EntryB : PINPUT_FILE_ELEMENT)
  127.          : ptrint; cdecl;
  128. begin
  129.   if EntryA^.ife_hi > EntryB^.ife_hi then exit(COMPARE_GREATER);
  130.   if EntryA^.ife_hi < EntryB^.ife_hi then exit(COMPARE_LESS);
  131.  
  132.   // the first qword of both entries is the same, use the last 2 bytes of the
  133.   // unixtime
  134.  
  135.   if EntryA^.ife_lo > EntryB^.ife_lo then exit(COMPARE_GREATER);
  136.   if EntryA^.ife_lo < EntryB^.ife_lo then exit(COMPARE_LESS);
  137.  
  138.   // they are the same
  139.  
  140.   result := COMPARE_EQUAL;
  141. end;
  142.  
  143. // -----------------------------------------------------------------------------
  144.  
  145. procedure Error(Id : ptruint);
  146. begin
  147.   write('FATAL : ');
  148.   case Id of
  149.     1 : writeln('LoadInputFileIntoMemory failed');
  150.     2 : writeln('WriteOutputFile - failed to create output file');
  151.     3 : writeln('WriteOutputFile - failed to write to output file');
  152.   end;
  153.  
  154.   halt(Id);
  155. end;
  156.  
  157. // -----------------------------------------------------------------------------
  158.  
  159. function LoadInputFileIntoMemory (Filename     : pchar;
  160.                               var Filesize     : qword;
  161.                               var OutputBuffer : pchar)
  162.          : PINPUT_FILE_ELEMENT;
  163.   // maps the input file in memory, determines its size and copies the input
  164.   // file into a separate memory block (because the input file is not supposed
  165.   // to be sorted.)  Also allocates a buffer for the output file.
  166. var
  167.   FileHandle           : THANDLE =   0;
  168.   FileMapping          : THANDLE =   0;
  169.   FileMapAddress       : pointer = nil;
  170.  
  171.   // variables that will be returned upon success
  172.  
  173.   FileData             : pointer = nil;
  174.   Size                 : qword   =   0;
  175.   FileOutBuffer        : pointer = nil;
  176.  
  177.   // to create a local scope (strictly local)
  178.  
  179.   SCOPE                : integer =   0;
  180.  
  181.   UnixTimesCount       : qword   =   0;
  182.  
  183. const
  184.   // constants used by CreateFile
  185.  
  186.   NO_TEMPLATE_FILE     =       0;
  187.  
  188.   // constants used by CreateFileMapping
  189.  
  190.   NO_MAXIMUM_SIZE_HIGH =       0;      // 0 indicates to use the size of the
  191.   NO_MAXIMUM_SIZE_LOW  =       0;      //   file
  192.  
  193.   // constants used by MapViewOfFileEx
  194.  
  195.   FILE_OFFSET_HIGH     =       0;      //   file offset to map from
  196.   FILE_OFFSET_LOW      =       0;
  197.  
  198.   BEGINNING_TO_END     =       0;
  199.  
  200. begin
  201.   // initialize return values
  202.  
  203.   result       := nil;
  204.   Filesize     := 0;
  205.   OutputBuffer := nil;
  206.  
  207.   // map the input file and allocate necessary resources.
  208.  
  209.   for SCOPE := 1 to 1 do       // trick to create a scope one can break out of
  210.   begin
  211.     FileHandle := CreateFileA(Filename,
  212.                               GENERIC_READ,
  213.                               FILE_SHARE_READ,
  214.                               nil,
  215.                               OPEN_EXISTING,
  216.                               FILE_ATTRIBUTE_NORMAL,
  217.                               NO_TEMPLATE_FILE);
  218.  
  219.     if FileHandle = INVALID_HANDLE_VALUE   then break;
  220.  
  221.     if not GetFileSizeEx(FileHandle, Size) then break;
  222.     if Size = 0                            then break;       // empty file
  223.  
  224.     // with the file handle, create a mapping for it
  225.  
  226.     FileMapping := CreateFileMappingA(FileHandle,
  227.                                       nil,
  228.                                       PAGE_READONLY,
  229.                                       NO_MAXIMUM_SIZE_HIGH,  // use file size
  230.                                       NO_MAXIMUM_SIZE_LOW,
  231.                                       nil);
  232.  
  233.     if (FileMapping = 0)                   then break;
  234.  
  235.     FileMapAddress := MapViewOfFileEx(FileMapping,
  236.                                       FILE_MAP_READ,
  237.                                       FILE_OFFSET_HIGH,      // from beginning
  238.                                       FILE_OFFSET_LOW,
  239.                                       BEGINNING_TO_END,      // to end
  240.                                       nil);                  // map anywhere
  241.  
  242.     if FileMapAddress = nil                then break;
  243.  
  244.     // allocate a memory block to hold the file data since the file itself
  245.     // won't be sorted
  246.  
  247.     FileData := HeapAlloc(GetProcessHeap(),
  248.                           0,
  249.                           Size);
  250.  
  251.     if FileData = nil                      then break;
  252.  
  253.  
  254.     // copy the data in the file to the block we just allocated
  255.  
  256.     RtlMoveMemory(FileData, FileMapAddress, Size);
  257.  
  258.     // allocate a buffer for the output file
  259.  
  260.     UnixTimesCount := Size div sizeof(TINPUT_FILE_ELEMENT);
  261.     FileOutBuffer  := HeapAlloc(GetProcessHeap(),
  262.                                 0,
  263.                                 Size + (UnixTimesCount * ADDITIONAL_OUTPUT));
  264.  
  265.     if FileOutBuffer = nil then break;        // just in case additional
  266.                                               // instructions are added at a
  267.                                               // later time.
  268.  
  269.   end;
  270.  
  271.   if (FileHandle     <> INVALID_HANDLE_VALUE) then CloseHandle(FileHandle);
  272.   if (FileMapping    <>                    0) then CloseHandle(FileMapping);
  273.  
  274.   if (FileMapAddress <> nil) then UnmapViewOfFile(FileMapAddress);
  275.  
  276.   if (FileData <> nil) then
  277.   begin
  278.     Filesize     := Size;
  279.     OutputBuffer := FileOutBuffer;
  280.     result       := FileData;
  281.   end;
  282. end;
  283.  
  284. // -----------------------------------------------------------------------------
  285.  
  286. procedure WriteOutputFile(OutputFilename               : pchar;
  287.                           DataOutputBuffer, DataOutPtr : pchar);
  288. const
  289.   FILE_NO_SHARE    = 0;
  290.   NO_TEMPLATE_FILE = 0;
  291.  
  292. var
  293.   FileHandle : THANDLE;
  294.   ByteCount  : ptruint;
  295.  
  296.   Ok           : BOOL  = FALSE;
  297.   BytesWritten : DWORD = 0;
  298.  
  299. begin
  300.   FileHandle := CreateFileA(OutputFilename,
  301.                             GENERIC_READ or GENERIC_WRITE,
  302.                             FILE_NO_SHARE,
  303.                             nil,
  304.                             CREATE_ALWAYS,
  305.                             FILE_ATTRIBUTE_NORMAL,
  306.                             NO_TEMPLATE_FILE);
  307.  
  308.   if FileHandle = INVALID_HANDLE_VALUE then Error(2);
  309.  
  310.   ByteCount := DataOutPtr - DataOutputBuffer;
  311.  
  312.  
  313.   Ok := WriteFile(FileHandle,
  314.                   DataOutputBuffer^,
  315.                   ByteCount,
  316.                   BytesWritten,
  317.                   nil);
  318.  
  319.   if not Ok then Error(3);
  320.  
  321.   CloseHandle(FileHandle);
  322. end;
  323.  
  324. // -----------------------------------------------------------------------------
  325.  
  326. procedure SortCount;
  327. var
  328.   Data             : PINPUT_FILE_ELEMENT = nil;
  329.  
  330.   DataEnd          : PINPUT_FILE_ELEMENT = nil;
  331.   DataOutputBuffer : pchar               = nil;
  332.   DataOutPtr       : pchar               = nil;
  333.   DataOutLength    : integer             =   0;
  334.  
  335.   Filesize         : qword               =   0;  // compiler whines otherwise
  336.  
  337.   InstanceCount    : integer             =   0;
  338.  
  339.   i, j             : PINPUT_FILE_ELEMENT;
  340.  
  341. begin
  342.   Data := LoadInputFileIntoMemory(PChar(InFileName), Filesize, DataOutputBuffer);
  343.  
  344.   if Data = nil then Error(1);    // an empty file is treated as an error
  345.  
  346.   // sort the data
  347.  
  348.   DataCount := Filesize div sizeof(TINPUT_FILE_ELEMENT);
  349.  
  350.   qsort(Data,
  351.         DataCount,
  352.         sizeof(Data^),
  353.         TCompareFunction(@CompareUnixTimes)
  354.        );
  355.  
  356.   // use the same algorithm used by Avk to produce the output file.
  357.  
  358.   Unique        := 0;
  359.   InstanceCount := 0;
  360.  
  361.   i        := Data;
  362.   j        := i;
  363.   DataEnd  := Data + DataCount;
  364.  
  365.   // determine the duplicate counts
  366.  
  367.   DataOutPtr := DataOutputBuffer;
  368.  
  369.   repeat
  370.     while (j < DataEnd)
  371.       and
  372.           ((j^.ife_hi = i^.ife_hi) and (j^.ife_lo = i^.ife_lo))
  373.        do
  374.     begin
  375.       inc(InstanceCount);
  376.       inc(j);
  377.     end;
  378.  
  379.     DataOutLength := sprintf(DataOutPtr,
  380.                              FORMAT,
  381.                              i^.ife_unixtime, InstanceCount);
  382.  
  383.     inc(DataOutPtr, DataOutLength);
  384.  
  385.     InstanceCount := 0;
  386.     inc(Unique);
  387.  
  388.     i := j;
  389.   until j >= DataEnd;
  390.  
  391.   WriteOutputFile(PChar(OutFileName), DataOutputBuffer, DataOutPtr);
  392.  
  393.   //writeln('#unique : ', Unique, '   #Total : ', DataCount);
  394.  
  395.   HeapFree(GetProcessHeap(), 0, Data);
  396.   HeapFree(GetProcessHeap(), 0, DataOutputBuffer);
  397. end;
  398.  
  399. // -----------------------------------------------------------------------------
  400. //
  401. //var
  402. //  Start : TDATETIME;
  403. //
  404. //begin
  405. //  Start := Now;
  406. //
  407. //  SortCount;
  408. //
  409. //  writeln;
  410. //  writeln('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  411. end.
  412.  

output:
Code: Text  [Select][+][-]
  1.  
  2. RandomRange = 5
  3. Julkas1's time: 2.7450 #unique: 490842 #total: 2000000
  4. Julkas2's time: 0.9680 #unique: 0 #total: 2000000
  5.   Akira's time: 1.4500 #unique: 490842 #total: 2000000
  6.  Howard's time: 1.2950 #unique: 490842 #total: 2000000
  7.    Avk1's time: 0.9830 #unique: 490842 #total: 2000000
  8.    Avk2's time: 0.7330 #unique: 490842 #total: 2000000
  9.   440bx's time: 0.7490 #unique: 490842 #total: 2000000
  10.  
  11. RandomRange = 6
  12. Julkas1's time: 2.8080 #unique: 578734 #total: 2000000
  13. Julkas2's time: 0.9820 #unique: 0 #total: 2000000
  14.   Akira's time: 1.5140 #unique: 578734 #total: 2000000
  15.  Howard's time: 1.3260 #unique: 578734 #total: 2000000
  16.    Avk1's time: 1.0290 #unique: 578734 #total: 2000000
  17.    Avk2's time: 0.7800 #unique: 578734 #total: 2000000
  18.   440bx's time: 0.7180 #unique: 578734 #total: 2000000
  19.  
  20. RandomRange = 7
  21. Julkas1's time: 2.9330 #unique: 659876 #total: 2000000
  22. Julkas2's time: 0.9990 #unique: 0 #total: 2000000
  23.   Akira's time: 1.5750 #unique: 659876 #total: 2000000
  24.  Howard's time: 1.3420 #unique: 659876 #total: 2000000
  25.    Avk1's time: 1.0920 #unique: 659876 #total: 2000000
  26.    Avk2's time: 0.7950 #unique: 659876 #total: 2000000
  27.   440bx's time: 0.7340 #unique: 659876 #total: 2000000
  28.  
  29. RandomRange = 8
  30. Julkas1's time: 3.0110 #unique: 733819 #total: 2000000
  31. Julkas2's time: 1.0450 #unique: 0 #total: 2000000
  32.   Akira's time: 1.6540 #unique: 733819 #total: 2000000
  33.  Howard's time: 1.3880 #unique: 733819 #total: 2000000
  34.    Avk1's time: 1.1230 #unique: 733819 #total: 2000000
  35.    Avk2's time: 0.8270 #unique: 733819 #total: 2000000
  36.   440bx's time: 0.7490 #unique: 733819 #total: 2000000
  37.  
  38. RandomRange = 9
  39. Julkas1's time: 3.0880 #unique: 802323 #total: 2000000
  40. Julkas2's time: 1.0460 #unique: 0 #total: 2000000
  41.   Akira's time: 1.7310 #unique: 802323 #total: 2000000
  42.  Howard's time: 1.3890 #unique: 802323 #total: 2000000
  43.    Avk1's time: 1.2320 #unique: 802323 #total: 2000000
  44.    Avk2's time: 0.8420 #unique: 802323 #total: 2000000
  45.   440bx's time: 0.7650 #unique: 802323 #total: 2000000
  46.  
  47. RandomRange = 10
  48. Julkas1's time: 3.1520 #unique: 864809 #total: 2000000
  49. Julkas2's time: 1.0600 #unique: 0 #total: 2000000
  50.   Akira's time: 1.8260 #unique: 864809 #total: 2000000
  51.  Howard's time: 1.4190 #unique: 864809 #total: 2000000
  52.    Avk1's time: 1.2330 #unique: 864809 #total: 2000000
  53.    Avk2's time: 0.8730 #unique: 864809 #total: 2000000
  54.   440bx's time: 0.7650 #unique: 864809 #total: 2000000
  55.  
  56. repeatMillionsCount = 10
  57. Julkas1's time: 13.8380 #unique: 799999 #total: 10000000
  58. Julkas2's time: 4.2580 #unique: 0 #total: 10000000
  59.   Akira's time: 4.9460 #unique: 799999 #total: 10000000
  60.  Howard's time: 5.8180 #unique: 799999 #total: 10000000
  61.    Avk1's time: 4.0100 #unique: 799999 #total: 10000000
  62.    Avk2's time: 3.0730 #unique: 799999 #total: 10000000
  63.   440bx's time: 3.1670 #unique: 799999 #total: 10000000
  64.  
  65. repeatMillionsCount = 12
  66. Julkas1's time: 16.4420 #unique: 799999 #total: 12000000
  67. Julkas2's time: 5.0550 #unique: 0 #total: 12000000
  68.   Akira's time: 5.7870 #unique: 799999 #total: 12000000
  69.  Howard's time: 6.9110 #unique: 799999 #total: 12000000
  70.    Avk1's time: 4.7110 #unique: 799999 #total: 12000000
  71.    Avk2's time: 3.6350 #unique: 799999 #total: 12000000
  72.   440bx's time: 3.7600 #unique: 799999 #total: 12000000
  73.  
  74. repeatMillionsCount = 14
  75. Julkas1's time: 19.2500 #unique: 800000 #total: 14000000
  76. Julkas2's time: 5.8660 #unique: 0 #total: 14000000
  77.   Akira's time: 6.6300 #unique: 800000 #total: 14000000
  78.  Howard's time: 8.0650 #unique: 800000 #total: 14000000
  79.    Avk1's time: 5.4290 #unique: 800000 #total: 14000000
  80.    Avk2's time: 4.1970 #unique: 800000 #total: 14000000
  81.   440bx's time: 4.3360 #unique: 800000 #total: 14000000
  82.  
  83. repeatMillionsCount = 16
  84. Julkas1's time: 22.1520 #unique: 800000 #total: 16000000
  85. Julkas2's time: 6.6620 #unique: 0 #total: 16000000
  86.   Akira's time: 7.4250 #unique: 800000 #total: 16000000
  87.  Howard's time: 9.1400 #unique: 800000 #total: 16000000
  88.    Avk1's time: 6.1000 #unique: 800000 #total: 16000000
  89.    Avk2's time: 4.7580 #unique: 800000 #total: 16000000
  90.   440bx's time: 4.9300 #unique: 800000 #total: 16000000
  91.  
  92. repeatMillionsCount = 18
  93. Julkas1's time: 24.3990 #unique: 800000 #total: 18000000
  94. Julkas2's time: 7.4720 #unique: 0 #total: 18000000
  95.   Akira's time: 8.2060 #unique: 800000 #total: 18000000
  96.  Howard's time: 10.2340 #unique: 800000 #total: 18000000
  97.    Avk1's time: 6.8170 #unique: 800000 #total: 18000000
  98.    Avk2's time: 5.3040 #unique: 800000 #total: 18000000
  99.   440bx's time: 5.5220 #unique: 800000 #total: 18000000
  100.  
  101. repeatMillionsCount = 20
  102. Julkas1's time: 27.5810 #unique: 800000 #total: 20000000
  103. Julkas2's time: 8.3310 #unique: 0 #total: 20000000
  104.   Akira's time: 9.0160 #unique: 800000 #total: 20000000
  105.  Howard's time: 11.3890 #unique: 800000 #total: 20000000
  106.    Avk1's time: 7.5340 #unique: 800000 #total: 20000000
  107.    Avk2's time: 5.9130 #unique: 800000 #total: 20000000
  108.   440bx's time: 6.0840 #unique: 800000 #total: 20000000
  109.  
Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 08:33:07 pm
@avk Thanks. Strange, why uniq is zero in my second algorithm.
I have tested it with @Akira1364 benchmark (without LGenerics).
Title: Re: Sorting and Counting
Post by: howardpc on July 19, 2019, 08:37:56 pm
I don't get a zero Unique with julkas' second implementation.
A typical comparison looks like this (I'm on Linux here, so have sadly omitted 440bx)
Code: Pascal  [Select][+][-]
  1.  Julkas1's time: 4.1120 #unique: 794594 #total: 4000000
  2.   Akira's time: 1.4990 #unique: 794594 #total: 4000000
  3.   Howard's time: 1.2950 #unique: 794594 #total: 4000000
  4.    Avk1's time: 1.2910 #unique: 794594 #total: 4000000
  5.  Julkas2's time: 1.0920 #unique: 794594 #total: 4000000
  6.    Avk2's time: 0.7510 #unique: 794594 #total: 4000000
Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 09:10:48 pm
Hm, I can improve my second edition ;D
Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 09:11:34 pm
@julkas, done
@440bx:
So, do you offer to compete with the Windows file mapping and ntdll? Okay, why not. :)
To run your SortCount (from SortCount2) inside a Howard's benchmark, I put the SortCount2 code in a separate unit:
I appreciate that, thank you. :)

I have to say, I am impressed with the performance of TGOrdinalArrayHelper. 

Code: Pascal  [Select][+][-]
  1. RandomRange = 5
  2.   Avk2's time: 0.7330 #unique: 490842 #total: 2000000
  3.  440bx's time: 0.7490 #unique: 490842 #total: 2000000
  4.  
  5. RandomRange = 6
  6.   Avk2's time: 0.7800 #unique: 578734 #total: 2000000
  7.  440bx's time: 0.7180 #unique: 578734 #total: 2000000
  8.  
  9. RandomRange = 7
  10.   Avk2's time: 0.7950 #unique: 659876 #total: 2000000
  11.  440bx's time: 0.7340 #unique: 659876 #total: 2000000
  12.  
  13. RandomRange = 8
  14.   Avk2's time: 0.8270 #unique: 733819 #total: 2000000
  15.  440bx's time: 0.7490 #unique: 733819 #total: 2000000
  16.  
  17. RandomRange = 9
  18.   Avk2's time: 0.8420 #unique: 802323 #total: 2000000
  19.  440bx's time: 0.7650 #unique: 802323 #total: 2000000
  20.  
  21. RandomRange = 10
  22.   Avk2's time: 0.8730 #unique: 864809 #total: 2000000
  23.  440bx's time: 0.7650 #unique: 864809 #total: 2000000
  24.  
I expected results like the above because using the API one doesn't have to incur in string to binary conversion back and forth (and that is only in this particular case.)  On the other hand, I can't quite figure out why when the test is repeated, your implementation performs better.

Code: Pascal  [Select][+][-]
  1. repeatMillionsCount = 10
  2.   Avk2's time: 3.0730 #unique: 799999 #total: 10000000
  3.  440bx's time: 3.1670 #unique: 799999 #total: 10000000
  4.  
  5. repeatMillionsCount = 12
  6.   Avk2's time: 3.6350 #unique: 799999 #total: 12000000
  7.  440bx's time: 3.7600 #unique: 799999 #total: 12000000
  8.  
  9. repeatMillionsCount = 14
  10.   Avk2's time: 4.1970 #unique: 800000 #total: 14000000
  11.  440bx's time: 4.3360 #unique: 800000 #total: 14000000
  12.  
  13. repeatMillionsCount = 16
  14.   Avk2's time: 4.7580 #unique: 800000 #total: 16000000
  15.  440bx's time: 4.9300 #unique: 800000 #total: 16000000
  16.  
  17. repeatMillionsCount = 18
  18.   Avk2's time: 5.3040 #unique: 800000 #total: 18000000
  19.  440bx's time: 5.5220 #unique: 800000 #total: 18000000
  20.  
  21. repeatMillionsCount = 20
  22.   Avk2's time: 5.9130 #unique: 800000 #total: 20000000
  23.  440bx's time: 6.0840 #unique: 800000 #total: 20000000
  24.  
The differences are not large but, they are consistent.  When executed many times over, your implementation performs better.  One possible reason that comes to mind is, qsort has to call the compare function "many times" and, I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.)  I thought that not having to do string to integer conversions would compensate for the overhead of passing parameters to the sort function, apparently that is not enough.

Either way, TGOrdinalArrayHelper does a very nice job. 

Are those results for 32bit programs or 64bits programs ?  I'm thinking that in 64bit, the API implementation may be able to match TGOrdinalArrayHelper because it doesn't have to push the pointer arguments onto the stack. 


ETA:

A typical comparison looks like this (I'm on Linux here, so have sadly omitted 440bx)
One thing is for sure, I cannot compete on portability <chuckle>
Title: Re: Sorting and Counting
Post by: Akira1364 on July 19, 2019, 09:46:04 pm
I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.)

As a (happy!) user of LGenerics, I can tell you: it doesn't. As it's TGOrdinalArrayHelper, it uses a built-in sort specifically designed for, well, ordinals (or at least, anything that can be directly compared to them without need of an explicit comparison function, which could possibly also be achieved via operator overloading in the case of more complex types.)

For anything that doesn't fit that description it just won't compile if specialized with them (failing explicitly on the lines that attempt to do the comparisons, obviously.)

LGenerics does also though use a classic (well, classic as much as concepts mostly only relevant since generics were introduced to the language can be considered classic) trick for avoiding the fact that FPC cannot inline function pointers (and thus could not inline "comparer callbacks" if written with an API that used them directly.)

Instead, it takes structured types containing static "class functions" with specific "required names" for different tasks as generic parameters, which can be inlined (if marked inline) they're not being accessed through pointers.

Here's a simplistic example of what I'm talking about (note: the following code is not actually something from LGenerics)

Code: Pascal  [Select][+][-]
  1. program Example;
  2.  
  3. {$mode Delphi}
  4.  
  5. type
  6.   TCallbackFilterer<T> = record
  7.   public type
  8.     TFilterTest = function(constref Current: T): Boolean;
  9.   public
  10.     // ↓↓↓ Test is a function pointer and can never be inlined
  11.     class function Filter(constref Values: array of T; const Test: TFilterTest): TArray<T>; static;
  12.   end;
  13.  
  14.   class function TCallbackFilterer<T>.Filter(constref Values: array of T; const Test: TFilterTest): TArray<T>;
  15.   var I, J: PtrUInt;
  16.   begin
  17.     J := 0;
  18.     SetLength(Result, Length(Values));
  19.     for I := 0 to High(Values) do
  20.       if Test(Values[I]) then begin
  21.         Result[J] := Values[I];
  22.         Inc(J);
  23.       end;
  24.     SetLength(Result, J);
  25.   end;
  26.  
  27. type
  28.   TFunctorFilterer<T, TFilterImpl> = record
  29.   (*
  30.     TFilterImpl being something like:
  31.  
  32.     type TNotGenericTester = record
  33.     public
  34.       class function Test(constref Current: SomeType): Boolean; static; inline;
  35.     end;
  36.  
  37.     or like:
  38.  
  39.     type TGenericTester<T> = record
  40.     public
  41.       class function Test(constref Current: T): Boolean; static; inline;
  42.     end;
  43.   *)
  44.   public
  45.     class function Filter(constref Values: array of T): TArray<T>; static;
  46.   end;
  47.  
  48.   class function TFunctorFilterer<T, TFilterImpl>.Filter(constref Values: array of T): TArray<T>;
  49.   var I, J: PtrUInt;
  50.   begin
  51.     J := 0;
  52.     SetLength(Result, Length(Values));
  53.     for I := 0 to High(Values) do
  54.       // ↓↓↓ This won't compile if TFilterImpl, whatever it may be, has no "Test" function.
  55.       // ↓↓↓ Again though, if it does, this part (usually) will be inlined if it is marked as such.
  56.       if TFilterImpl.Test(Values[I]) then begin
  57.         Result[J] := Values[I];
  58.         Inc(J);
  59.       end;
  60.     SetLength(Result, J);
  61.   end;
  62.  
  63. begin
  64. end.
Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 10:03:17 pm
As a (happy!) user of LGenerics, I can tell you: it doesn't. As it's TGOrdinalArrayHelper, it uses a built-in sort specifically designed for, well, ordinals (or at least, anything that can be directly compared to them without need of an explicit comparison function, which could possibly also be achieved via operator overloading in the case of more complex types.)
Thank you for clearing that.  As you pointed out, the presence of Ordinal in the helper name is a good hint that the helper is designed and optimized for ordinal types.  I suspected it but, without looking at the actual code, I didn't want to draw any conclusions based on the just the name.

That would explain why, as the number of elements to sort grows, its advantage over anything that passes parameters on the stack grows as well. 

For anything that doesn't fit that description it just won't compile if specialized with them (failing explicitly on the lines that attempt to do the comparisons, obviously.)
That makes perfect sense.

Thank you.
Title: Re: Sorting and Counting
Post by: VTwin on July 20, 2019, 12:53:32 am
Just to see what might be done without using generics, I tried replacing the sort in Akv2 with a HeapSort and then a QuickSort (in howardpc's OccurenceCounter). The HeapSort version was about twice as slow as howardpc's, while the QuickSort version edged out that code. I did not directly compare it to Akv2, but that code smokes.

Very interesting!
Title: Re: Sorting and Counting
Post by: Akira1364 on July 20, 2019, 02:51:29 am
Just to see what might be done without using generics, I tried replacing the sort in Akv2 with a HeapSort and then a QuickSort (in howardpc's OccurenceCounter). The HeapSort version was about twice as slow as howardpc's, while the QuickSort version edged out that code. I did not directly compare it to Akv2, but that code smokes.
Very interesting!

TGOrdinalArrayHelper itself does also have a QuickSort, BTW. Also IntroSort and "DualPivotQuickSort."
Title: Re: Sorting and Counting
Post by: avk on July 20, 2019, 06:17:24 am
@julkas, sorry, it was my mistake,
correct code and output:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, garrayutils, gvector, gmap,
  12.   WinSortCount2;
  13.  
  14. type
  15.   TIntPair = TPair<LongInt, LongInt>;
  16.   TProcedureArray = array of procedure;
  17.  
  18.   function ComparePairs(constref L, R: TIntPair): LongInt;
  19.   begin
  20.     if L.Key < R.Key then
  21.       Result := -1
  22.     else if L.Key = R.Key then
  23.       Result := 0
  24.     else
  25.       Result := 1;
  26.   end;
  27.  
  28. var
  29.   Total, Unique, repeatCount, randomrange: Integer;
  30.   Start: TDateTime;
  31.   inFilename: String = 'data.txt';
  32.   outFilename: String = 'sorted.txt';
  33.   routineName: String;
  34.   procedures: TProcedureArray;
  35.   proc: procedure;
  36.  
  37.   procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  38.   var
  39.     InFile: Text;
  40.     I: LongInt;
  41.   begin
  42.     Assign(InFile, inFilename);
  43.     Rewrite(InFile);
  44.     for I := 1 to repeatMillionsCount * 1000000 do
  45.       WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
  46.     Close(InFile);
  47.   end;
  48.  
  49.   procedure SortCountAkira;
  50.   var
  51.     I: LongInt;
  52.     InOut: Text;
  53.     Map: TDictionary<LongInt, LongInt>;
  54.     Pair: TIntPair;
  55.     Pairs: TArray<TIntPair>;
  56.   begin
  57.     routineName := {$I %currentroutine%};
  58.     Map := TDictionary<LongInt, LongInt>.Create();
  59.     Map.Capacity := 10000000;
  60.     Assign(InOut, inFilename);
  61.     Reset(InOut);
  62.     while not EOF(InOut) do begin
  63.       ReadLn(InOut, I);
  64.       Inc(Total);
  65.       if not Map.ContainsKey(I) then
  66.         begin
  67.           Map.Add(I, 1);
  68.           Inc(Unique);
  69.         end
  70.       else
  71.         Map[I] := Map[I] + 1;
  72.     end;
  73.     Close(InOut);
  74.     Pairs := Map.ToArray();
  75.     Map.Free();
  76.     TArrayHelper<TIntPair>.Sort(
  77.       Pairs,
  78.       TComparer<TIntPair>.Construct(ComparePairs)
  79.     );
  80.     Assign(InOut, outFilename);
  81.     Rewrite(InOut);
  82.     for Pair in Pairs do with Pair do
  83.       WriteLn(InOut, Key, ' - ', Value);
  84.     Close(InOut);
  85.   end;
  86.  
  87.   procedure SortCountHoward;
  88.   var
  89.     arr: array of Integer;
  90.     textf: TextFile;
  91.     min: Integer = High(Integer);
  92.     max: Integer = -1;
  93.     i: Integer;
  94.   begin
  95.     routineName := {$I %currentroutine%};
  96.     AssignFile(textf, inFilename);
  97.     Reset(textf);
  98.     while not EOF(textf) do
  99.       begin
  100.         ReadLn(textf, i);
  101.         Inc(Total);
  102.         if i < min then
  103.           min := i;
  104.         if i > max then
  105.           max := i;
  106.       end;
  107.     SetLength(arr, max-min+1);
  108.  
  109.     Reset(textf);
  110.     while not EOF(textf) do
  111.       begin
  112.         ReadLn(textf, i);
  113.         Dec(i, min);
  114.         Inc(arr[i]);
  115.       end;
  116.     CloseFile(textf);
  117.  
  118.     AssignFile(textf, outFilename);
  119.     Rewrite(textf);
  120.     for i := Low(arr) to High(arr) do
  121.       case (arr[i] > 0) of
  122.         True:
  123.           begin
  124.             WriteLn(textf, i+min, ' - ', arr[i]);
  125.             Inc(Unique);
  126.           end;
  127.       end;
  128.     CloseFile(textf);
  129.     SetLength(arr, 0);
  130.   end;
  131.  
  132.   procedure SortCountAvk1;
  133.   type
  134.     TCounter  = TGHashMultiSetLP<Integer>;
  135.     TCountRef = TGAutoRef<TCounter>;
  136.     TEntry    = TCounter.TEntry;
  137.  
  138.     function EntryCmp(constref L, R: TEntry): SizeInt;
  139.     begin
  140.       if L.Key > R.Key then
  141.         Result := 1
  142.       else
  143.         if L.Key < R.Key then
  144.           Result := -1
  145.         else
  146.           Result := 0;
  147.     end;
  148.  
  149.   var
  150.     CountRef: TCountRef;
  151.     InOut: Text;
  152.     Counter: TCounter;
  153.     e: TEntry;
  154.     I: Integer;
  155.   begin
  156.     routineName := {$I %currentroutine%};
  157.     Counter := CountRef;
  158.     Counter.LoadFactor := 0.7;
  159.     Assign(InOut, inFilename);
  160.     Reset(InOut);
  161.     while not EOF(InOut) do
  162.       begin
  163.         ReadLn(InOut, I);
  164.         Counter.Add(I);
  165.       end;
  166.     Close(InOut);
  167.     Total := Counter.Count;
  168.     Unique := Counter.EntryCount;
  169.     if Counter.NonEmpty then
  170.       begin
  171.         Assign(InOut, outFilename);
  172.         Rewrite(InOut);
  173.         for e in Counter.Entries.Sorted(EntryCmp) do
  174.           with e do
  175.             WriteLn(InOut, Key, ' - ', Count);
  176.         Close(InOut);
  177.       end;
  178.   end;
  179.  
  180.   procedure SortCountAvk2;
  181.   var
  182.     List: array of Integer;
  183.     InOut: Text;
  184.     I, J, Count, DupCount: Integer;
  185.   begin
  186.     routineName := {$I %currentroutine%};
  187.     Assign(InOut, inFilename);
  188.     Reset(InOut);
  189.     SetLength(List, 4096);
  190.     I := 0;
  191.     while not EOF(InOut) do
  192.       begin
  193.         ReadLn(InOut, J);
  194.         Inc(Total);
  195.         if Length(List) = I then
  196.           SetLength(List, I * 2);
  197.         List[I] := J;
  198.         Inc(I);
  199.       end;
  200.     Close(InOut);
  201.     SetLength(List, I);
  202.     if List = nil then
  203.       exit;
  204.     TGOrdinalArrayHelper<Integer>.Sort(List);
  205.     Count := I;
  206.     DupCount := 0;
  207.     I := 0;
  208.     Assign(InOut, outFilename);
  209.     Rewrite(InOut);
  210.     repeat
  211.       J := List[I];
  212.       while (I < Count) and (List[I] = J) do
  213.         begin
  214.           Inc(DupCount);
  215.           Inc(I);
  216.         end;
  217.       WriteLn(InOut, J, ' - ', DupCount);
  218.       Inc(Unique);
  219.       DupCount := 0;
  220.     until I = Count;
  221.     Close(InOut);
  222.   end;
  223.  
  224.   procedure SortCountJulkas1;
  225.   type
  226.     TIntLess = TLess<LongInt>;
  227.     TDict = TMap<LongInt, LongInt, TIntLess>;
  228.   var
  229.     sc: TDict;
  230.     scit: TDict.TIterator;
  231.     InOut: Text;
  232.     key, cnt: LongInt;
  233.   begin
  234.     routineName := {$I %currentroutine%};
  235.     sc := TDict.Create;
  236.     Assign(InOut, inFilename);
  237.     Reset(InOut);
  238.     while not EOF(InOut) do
  239.       begin
  240.         ReadLn(InOut, key);
  241.         Inc(Total);
  242.         cnt := 0;
  243.         sc.TryGetValue(key, cnt);
  244.         sc[key] := cnt + 1;
  245.       end;
  246.     Close(InOut);
  247.     Unique := sc.Size;
  248.     if Unique > 0 then
  249.       begin
  250.         Assign(InOut, outFilename);
  251.         Rewrite(InOut);
  252.         scit := sc.Min;
  253.         repeat
  254.           WriteLn(InOut, scit.Key, ' - ', scit.Value);
  255.         until not scit.Next;
  256.         Close(InOut);
  257.         scit.Free;
  258.       end;
  259.     sc.Free;
  260.   end;
  261.  
  262.   procedure SortCountJulkas2;
  263.   type
  264.     TIntLess = TLess<LongInt>;
  265.     TIntVect = TVector<LongInt>;
  266.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  267.   var
  268.     sc: array[0..21474] of TIntVect;
  269.     i: LongInt;
  270.     pkey, key, cnt: LongInt;
  271.     offset: LongInt;
  272.     InOut: Text;
  273.   begin
  274.     routineName := {$I %currentroutine%};
  275.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  276.  
  277.     Assign(InOut, inFilename);
  278.     Reset(InOut);
  279.     while not EOF(InOut) do
  280.       begin
  281.         ReadLn(InOut, key);
  282.         Inc(Total);
  283.         sc[key div 100000].PushBack(key mod 100000);
  284.       end;
  285.     Close(InOut);
  286.  
  287.     Assign(InOut, outFilename);
  288.     Rewrite(InOut);
  289.  
  290.     for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  291.  
  292.     offset := -100000;
  293.     for i := Low(sc) to High(sc) do
  294.     begin
  295.       Inc(offset, 100000);
  296.       pkey := -1;
  297.       cnt := 0;
  298.       for key in sc[i] do
  299.       begin
  300.         if pkey <> key then
  301.         begin
  302.           if cnt <> 0 then
  303.           begin
  304.             WriteLn(InOut, offset + pkey, ' - ', cnt);
  305.             Inc(Unique);
  306.           end;
  307.           pkey := key;
  308.           cnt := 0;
  309.         end;
  310.         Inc(cnt);
  311.       end;
  312.       if cnt <> 0 then
  313.       begin
  314.         WriteLn(InOut, offset + pkey, ' - ', cnt);
  315.         Inc(Unique);
  316.       end;
  317.     end;
  318.  
  319.     Close(InOut);
  320.     for i := Low(sc) to High(sc) do
  321.       sc[i].Destroy;
  322.   end;
  323.  
  324.   procedure SortCount440bx;
  325.   begin
  326.     routineName := {$I %currentroutine%};
  327.     WinSortCount2.DataCount := 0;
  328.     WinSortCount2.Unique := 0;
  329.     WinSortCount2.InFileName := inFilename;
  330.     WinSortCount2.OutFileName := outFilename;
  331.     WinSortCount2.SortCount;
  332.     Total := WinSortCount2.DataCount;
  333.     Unique := WinSortCount2.Unique;
  334.   end;
  335.  
  336.   procedure Run(aProc: TProcedure);
  337.   begin
  338.     Total := 0;
  339.     Unique := 0;
  340.     Start := Now;
  341.     try
  342.       aProc();
  343.       WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
  344.     except
  345.       on e: Exception do
  346.         WriteLn('crashes with message "', e.Message, '"');
  347.     end;
  348.   end;
  349.  
  350. begin
  351.   Randomize;
  352.  
  353.   procedures := TProcedureArray.Create(
  354.                    SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
  355.                    SortCountAvk2, SortCount440bx);
  356.  
  357.   for randomrange := 5 to 10 do
  358.     begin
  359.       GenerateData(randomrange);
  360.       WriteLn(#10'RandomRange = ',randomrange);
  361.       for proc in procedures do
  362.         Run(proc);
  363.     end;
  364.  
  365.   for repeatCount := 5 to 10 do
  366.     begin
  367.       GenerateData(8, 2*repeatCount);
  368.       WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
  369.       for proc in procedures do
  370.         Run(proc);
  371.     end;
  372. end.
  373.  
Code: Text  [Select][+][-]
  1.  
  2. RandomRange = 5
  3. Julkas1's time: 2.7300 #unique: 490897 #total: 2000000
  4. Julkas2's time: 0.9200 #unique: 490897 #total: 2000000
  5.   Akira's time: 1.4530 #unique: 490897 #total: 2000000
  6.  Howard's time: 1.2790 #unique: 490897 #total: 2000000
  7.    Avk1's time: 0.9770 #unique: 490897 #total: 2000000
  8.    Avk2's time: 0.6910 #unique: 490897 #total: 2000000
  9.   440bx's time: 0.7130 #unique: 490897 #total: 2000000
  10.  
  11. RandomRange = 6
  12. Julkas1's time: 2.8220 #unique: 578415 #total: 2000000
  13. Julkas2's time: 0.9490 #unique: 578415 #total: 2000000
  14.   Akira's time: 1.4860 #unique: 578415 #total: 2000000
  15.  Howard's time: 1.2890 #unique: 578415 #total: 2000000
  16.    Avk1's time: 1.0180 #unique: 578415 #total: 2000000
  17.    Avk2's time: 0.7690 #unique: 578415 #total: 2000000
  18.   440bx's time: 0.7360 #unique: 578415 #total: 2000000
  19.  
  20. RandomRange = 7
  21. Julkas1's time: 3.0370 #unique: 659865 #total: 2000000
  22. Julkas2's time: 1.0280 #unique: 659865 #total: 2000000
  23.   Akira's time: 1.5800 #unique: 659865 #total: 2000000
  24.  Howard's time: 1.3480 #unique: 659865 #total: 2000000
  25.    Avk1's time: 1.0770 #unique: 659865 #total: 2000000
  26.    Avk2's time: 0.7860 #unique: 659865 #total: 2000000
  27.   440bx's time: 0.7690 #unique: 659865 #total: 2000000
  28.  
  29. RandomRange = 8
  30. Julkas1's time: 3.1540 #unique: 734083 #total: 2000000
  31. Julkas2's time: 1.0200 #unique: 734083 #total: 2000000
  32.   Akira's time: 1.6300 #unique: 734083 #total: 2000000
  33.  Howard's time: 1.3630 #unique: 734083 #total: 2000000
  34.    Avk1's time: 1.1290 #unique: 734083 #total: 2000000
  35.    Avk2's time: 0.8000 #unique: 734083 #total: 2000000
  36.   440bx's time: 0.7670 #unique: 734083 #total: 2000000
  37.  
  38. RandomRange = 9
  39. Julkas1's time: 3.0850 #unique: 802832 #total: 2000000
  40. Julkas2's time: 1.0320 #unique: 802832 #total: 2000000
  41.   Akira's time: 1.7190 #unique: 802832 #total: 2000000
  42.  Howard's time: 1.3930 #unique: 802832 #total: 2000000
  43.    Avk1's time: 1.1770 #unique: 802832 #total: 2000000
  44.    Avk2's time: 0.8290 #unique: 802832 #total: 2000000
  45.   440bx's time: 0.7690 #unique: 802832 #total: 2000000
  46.  
  47. RandomRange = 10
  48. Julkas1's time: 3.3060 #unique: 865097 #total: 2000000
  49. Julkas2's time: 1.0560 #unique: 865097 #total: 2000000
  50.   Akira's time: 1.7560 #unique: 865097 #total: 2000000
  51.  Howard's time: 1.4280 #unique: 865097 #total: 2000000
  52.    Avk1's time: 1.2060 #unique: 865097 #total: 2000000
  53.    Avk2's time: 0.8500 #unique: 865097 #total: 2000000
  54.   440bx's time: 0.7820 #unique: 865097 #total: 2000000
  55.  
  56. repeatMillionsCount = 10
  57. Julkas1's time: 13.9870 #unique: 799997 #total: 10000000
  58. Julkas2's time: 4.2000 #unique: 799997 #total: 10000000
  59.   Akira's time: 5.0900 #unique: 799997 #total: 10000000
  60.  Howard's time: 5.9450 #unique: 799997 #total: 10000000
  61.    Avk1's time: 4.0040 #unique: 799997 #total: 10000000
  62.    Avk2's time: 3.0150 #unique: 799997 #total: 10000000
  63.   440bx's time: 3.2640 #unique: 799997 #total: 10000000
  64.  
  65. repeatMillionsCount = 12
  66. Julkas1's time: 17.3100 #unique: 799999 #total: 12000000
  67. Julkas2's time: 5.4000 #unique: 799999 #total: 12000000
  68.   Akira's time: 6.4410 #unique: 799999 #total: 12000000
  69.  Howard's time: 7.3240 #unique: 799999 #total: 12000000
  70.    Avk1's time: 4.6600 #unique: 799999 #total: 12000000
  71.    Avk2's time: 3.5000 #unique: 799999 #total: 12000000
  72.   440bx's time: 3.8170 #unique: 799999 #total: 12000000
  73.  
  74. repeatMillionsCount = 14
  75. Julkas1's time: 19.2340 #unique: 800000 #total: 14000000
  76. Julkas2's time: 5.7660 #unique: 800000 #total: 14000000
  77.   Akira's time: 6.3470 #unique: 800000 #total: 14000000
  78.  Howard's time: 8.0130 #unique: 800000 #total: 14000000
  79.    Avk1's time: 5.3040 #unique: 800000 #total: 14000000
  80.    Avk2's time: 4.0090 #unique: 800000 #total: 14000000
  81.   440bx's time: 4.4350 #unique: 800000 #total: 14000000
  82.  
  83. repeatMillionsCount = 16
  84. Julkas1's time: 21.5130 #unique: 800000 #total: 16000000
  85. Julkas2's time: 6.5050 #unique: 800000 #total: 16000000
  86.   Akira's time: 7.1720 #unique: 800000 #total: 16000000
  87.  Howard's time: 9.0940 #unique: 800000 #total: 16000000
  88.    Avk1's time: 6.7390 #unique: 800000 #total: 16000000
  89.    Avk2's time: 4.7270 #unique: 800000 #total: 16000000
  90.   440bx's time: 5.0230 #unique: 800000 #total: 16000000
  91.  
  92. repeatMillionsCount = 18
  93. Julkas1's time: 27.4910 #unique: 800000 #total: 18000000
  94. Julkas2's time: 7.9570 #unique: 800000 #total: 18000000
  95.   Akira's time: 7.9090 #unique: 800000 #total: 18000000
  96.  Howard's time: 10.1090 #unique: 800000 #total: 18000000
  97.    Avk1's time: 6.7080 #unique: 800000 #total: 18000000
  98.    Avk2's time: 5.1170 #unique: 800000 #total: 18000000
  99.   440bx's time: 5.5690 #unique: 800000 #total: 18000000
  100.  
  101. repeatMillionsCount = 20
  102. Julkas1's time: 26.8320 #unique: 800000 #total: 20000000
  103. Julkas2's time: 8.0490 #unique: 800000 #total: 20000000
  104.   Akira's time: 8.6270 #unique: 800000 #total: 20000000
  105.  Howard's time: 11.1390 #unique: 800000 #total: 20000000
  106.    Avk1's time: 7.3470 #unique: 800000 #total: 20000000
  107.    Avk2's time: 5.6160 #unique: 800000 #total: 20000000
  108.   440bx's time: 6.2250 #unique: 800000 #total: 20000000
  109.  

@Akira1364, thank you for the detailed explanation LGenerics's internals.
Title: Re: Sorting and Counting
Post by: lucamar on July 20, 2019, 06:23:13 am
Fascinating thread.

We should send a copy to Knuth, to show that the flame is being kept alive :D
Title: Re: Sorting and Counting
Post by: 440bx on July 20, 2019, 07:36:17 am
Fascinating thread.
yes, it is.

One thing I noticed is that the number of unique occurrences basically stays constant from 10 million to 20 million.  That's likely caused by the range not growing along with the number of unixtimes in the file.

It seems reasonable to presume that a larger number of elements are usually obtained over a longer period of time which would mean that the range (controlled by randomRange) should be larger as a result of that.

Increasing the range as the number of unixtimes grows would likely reflect what happens in reality more accurately.


ETA:

The attached version - SortCount3 (unit) - should perform better than SortCount2 (if the range remains constant instead of increasing.)



Title: Re: Sorting and Counting
Post by: ASerge on July 20, 2019, 10:09:18 am
Looks great, but not testable. Avk, can you upload the full project compilable on fpc 3.0.4? The function from 440bx needs to be wrapped in {$IFDEF WINDOWS}.
Title: Re: Sorting and Counting
Post by: julkas on July 20, 2019, 11:54:24 am
Second edition improvement. CPU - Intel Dual E2200, RAM - 3 GB, SATA HDD. Please update your benchmarks (check Lazarus project mode - debug or release, my - release)  ;D
Code: Pascal  [Select][+][-]
  1. program sc2;
  2. {$mode delphi}
  3. uses gvector, gutil, garrayutils, SysUtils;
  4. const
  5.   keyNum = 10000000;
  6.   bsz = 1 shl 17;
  7. type
  8.   TIntLess = TLess<LongInt>;
  9.   TIntVect = TVector<LongInt>;
  10.   TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  11. var
  12.   sc: array[0..2147483647 shr 17] of TIntVect;
  13.   i: LongInt;
  14.   pkey, key, cnt, uniq: LongInt;
  15.   offset: LongInt;
  16.   start: QWord;
  17.   outFile: Text;
  18. begin
  19.   start := GetTickCount64();
  20.   for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  21.  
  22.   for i := 0 to keyNum do
  23.   begin
  24.     key := Random(2147483647);
  25.     //key := 1500000000 + Random(800000);
  26.     sc[key shr 17].PushBack(key and $1FFFF);
  27.   end;
  28.   WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  29.   Assign(outFile, 'out.txt');
  30.   Rewrite(outFile);
  31.  
  32.   offset := -bsz;
  33.   uniq := 0;
  34.   for i := Low(sc) to High(sc) do
  35.   begin
  36.     Inc(offset, bsz);
  37.     pkey := -1;
  38.     cnt := 0;
  39.     if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  40.     for key in sc[i] do
  41.     begin
  42.       if pkey <> key then
  43.       begin
  44.         if cnt <> 0 then
  45.         begin
  46.           WriteLn(outFile, offset + pkey, ' - ', cnt);
  47.           Inc(uniq);
  48.         end;
  49.         pkey := key;
  50.         cnt := 0;
  51.       end;
  52.       Inc(cnt);
  53.     end;
  54.     if cnt <> 0 then
  55.     begin
  56.       WriteLn(outFile, offset + pkey, ' - ', cnt);
  57.       Inc(uniq);
  58.     end;
  59.   end;
  60.  
  61.   Close(outFile);
  62.   for i := Low(sc) to High(sc) do sc[i].Destroy;
  63.   WriteLn('Total (ticks) - ', GetTickCount64() - start);
  64.   WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  65.   ReadLn;
  66. end.
Output
Code: Text  [Select][+][-]
  1. Populated (ticks) - 3853
  2. Total (ticks) - 15397
  3. Uniq keys - 9976566, out of - 10000000
Title: Re: Sorting and Counting
Post by: Thaddy on July 20, 2019, 03:05:54 pm
Glad to see 440bx used my suggestion to use memory mapped files. This can also be done for unix-likes with fpmmap/fpmflush/fpmunmap. It should render a similar speed increase (as I explained)
Title: Re: Sorting and Counting
Post by: VTwin on July 20, 2019, 06:30:18 pm
TGOrdinalArrayHelper itself does also have a QuickSort, BTW. Also IntroSort and "DualPivotQuickSort."

Impressive. I see the default is an Introsort. I tried my own home-rolled Quicksort, Heapsort, and DualPivotQuicksort, to see how they held up. The Quicksort does well, oddly the DualPivotQuicksort does worse.

It goes to show the value of a well crafted generics library, no tambourines required. I have some catch up reading to do on generics. Kudos to avk.

Code: Pascal  [Select][+][-]
  1. repeatMillionsCount = 20
  2.     Howard time: 7.1940 #unique: 800000 #total: 20000000
  3.     VTHeap time: 17.262 #unique: 800000 #total: 20000000
  4.    VTQuick time: 6.5210 #unique: 800000 #total: 20000000
  5.  VTDPQuick time: 8.6090 #unique: 800000 #total: 20000000
Title: Re: Sorting and Counting
Post by: avk on July 21, 2019, 07:03:41 am
@VTwin, thanks.
It seems DualPivotQuicksort is quit good when the input array contains only a few repetitive values.

@ASerge, I'm not sure about rtl-generics, but LGenerics is definitely incompatible with FPC 3.0.4.

@julkas, I replaced the SortCountJulkas1 code with the one of your new version.

@ 440bx, I used your new WinSortCount3, but I turned off I/O checking.
Benchmark is compiled with a 32-bit compiler and runs on 64-bit Windows7.

code:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, garrayutils, gvector, gmap,
  12.   WinSortCount3;
  13.  
  14. type
  15.   TIntPair = TPair<LongInt, LongInt>;
  16.   TProcedureArray = array of procedure;
  17.  
  18.   function ComparePairs(constref L, R: TIntPair): LongInt;
  19.   begin
  20.     if L.Key < R.Key then
  21.       Result := -1
  22.     else if L.Key = R.Key then
  23.       Result := 0
  24.     else
  25.       Result := 1;
  26.   end;
  27.  
  28. var
  29.   Total, Unique, repeatCount, randomrange: Integer;
  30.   Start: TDateTime;
  31.   inFilename: String = 'data.txt';
  32.   outFilename: String = 'sorted.txt';
  33.   routineName: String;
  34.   procedures: TProcedureArray;
  35.   proc: procedure;
  36.  
  37.   procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  38.   var
  39.     InFile: Text;
  40.     I: LongInt;
  41.   begin
  42.     Assign(InFile, inFilename);
  43.     Rewrite(InFile);
  44.     for I := 1 to repeatMillionsCount * 1000000 do
  45.       WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
  46.     Close(InFile);
  47.   end;
  48.  
  49.   procedure SortCountAkira;
  50.   var
  51.     I: LongInt;
  52.     InOut: Text;
  53.     Map: TDictionary<LongInt, LongInt>;
  54.     Pair: TIntPair;
  55.     Pairs: TArray<TIntPair>;
  56.   begin
  57.     routineName := {$I %currentroutine%};
  58.     Map := TDictionary<LongInt, LongInt>.Create();
  59.     //Map.Capacity := 10000000;
  60.     Assign(InOut, inFilename);
  61.     Reset(InOut);
  62.     while not EOF(InOut) do begin
  63.       {$I-}ReadLn(InOut, I);{$I+}
  64.       Inc(Total);
  65.       if not Map.ContainsKey(I) then
  66.         begin
  67.           Map.Add(I, 1);
  68.           Inc(Unique);
  69.         end
  70.       else
  71.         Map[I] := Map[I] + 1;
  72.     end;
  73.     Close(InOut);
  74.     Pairs := Map.ToArray();
  75.     Map.Free();
  76.     TArrayHelper<TIntPair>.Sort(
  77.       Pairs,
  78.       TComparer<TIntPair>.Construct(ComparePairs)
  79.     );
  80.     Assign(InOut, outFilename);
  81.     Rewrite(InOut);
  82.     for Pair in Pairs do with Pair do
  83.       {$I-}WriteLn(InOut, Key, ' - ', Value);{$I+}
  84.     Close(InOut);
  85.   end;
  86.  
  87.   procedure SortCountHoward;
  88.   var
  89.     arr: array of Integer;
  90.     textf: TextFile;
  91.     min: Integer = High(Integer);
  92.     max: Integer = -1;
  93.     i: Integer;
  94.   begin
  95.     routineName := {$I %currentroutine%};
  96.     AssignFile(textf, inFilename);
  97.     Reset(textf);
  98.     while not EOF(textf) do
  99.       begin
  100.         {$I-}ReadLn(textf, i);{$I+}
  101.         Inc(Total);
  102.         if i < min then
  103.           min := i;
  104.         if i > max then
  105.           max := i;
  106.       end;
  107.     SetLength(arr, max-min+1);
  108.  
  109.     Reset(textf);
  110.     while not EOF(textf) do
  111.       begin
  112.         ReadLn(textf, i);
  113.         Dec(i, min);
  114.         Inc(arr[i]);
  115.       end;
  116.     CloseFile(textf);
  117.  
  118.     AssignFile(textf, outFilename);
  119.     Rewrite(textf);
  120.     for i := Low(arr) to High(arr) do
  121.       case (arr[i] > 0) of
  122.         True:
  123.           begin
  124.             {$I-}WriteLn(textf, i+min, ' - ', arr[i]); {$I+}
  125.             Inc(Unique);
  126.           end;
  127.       end;
  128.     CloseFile(textf);
  129.     SetLength(arr, 0);
  130.   end;
  131.  
  132.   procedure SortCountAvk1;
  133.   type
  134.     TCounter  = TGHashMultiSetLP<Integer>;
  135.     TCountRef = TGAutoRef<TCounter>;
  136.     TEntry    = TCounter.TEntry;
  137.  
  138.     function EntryCmp(constref L, R: TEntry): SizeInt;
  139.     begin
  140.       if L.Key > R.Key then
  141.         Result := 1
  142.       else
  143.         if L.Key < R.Key then
  144.           Result := -1
  145.         else
  146.           Result := 0;
  147.     end;
  148.  
  149.   var
  150.     CountRef: TCountRef;
  151.     InOut: Text;
  152.     Counter: TCounter;
  153.     e: TEntry;
  154.     I: Integer;
  155.   begin
  156.     routineName := {$I %currentroutine%};
  157.     Counter := CountRef;
  158.     Counter.LoadFactor := 0.7;
  159.     Assign(InOut, inFilename);
  160.     Reset(InOut);
  161.     while not EOF(InOut) do
  162.       begin
  163.         {$I-}ReadLn(InOut, I);{$I+}
  164.         Counter.Add(I);
  165.       end;
  166.     Close(InOut);
  167.     Total := Counter.Count;
  168.     Unique := Counter.EntryCount;
  169.     if Counter.NonEmpty then
  170.       begin
  171.         Assign(InOut, outFilename);
  172.         Rewrite(InOut);
  173.         for e in Counter.Entries.Sorted(EntryCmp) do
  174.           with e do
  175.             {$I-}WriteLn(InOut, Key, ' - ', Count);{$I+}
  176.         Close(InOut);
  177.       end;
  178.   end;
  179.  
  180.   procedure SortCountAvk2;
  181.   var
  182.     List: array of Integer;
  183.     InOut: Text;
  184.     I, J, Count, DupCount: Integer;
  185.   begin
  186.     routineName := {$I %currentroutine%};
  187.     Assign(InOut, inFilename);
  188.     Reset(InOut);
  189.     SetLength(List, 4096);
  190.     I := 0;
  191.     while not EOF(InOut) do
  192.       begin
  193.         {$I-}ReadLn(InOut, J);{$I+}
  194.         Inc(Total);
  195.         if Length(List) = I then
  196.           SetLength(List, I * 2);
  197.         List[I] := J;
  198.         Inc(I);
  199.       end;
  200.     Close(InOut);
  201.     SetLength(List, I);
  202.     if List = nil then
  203.       exit;
  204.     TGOrdinalArrayHelper<Integer>.Sort(List);
  205.     Count := I;
  206.     DupCount := 0;
  207.     I := 0;
  208.     Assign(InOut, outFilename);
  209.     Rewrite(InOut);
  210.     repeat
  211.       J := List[I];
  212.       while (I < Count) and (List[I] = J) do
  213.         begin
  214.           Inc(DupCount);
  215.           Inc(I);
  216.         end;
  217.       {$I-}WriteLn(InOut, J, ' - ', DupCount);{$I+}
  218.       Inc(Unique);
  219.       DupCount := 0;
  220.     until I = Count;
  221.     Close(InOut);
  222.   end;
  223.  
  224.   procedure SortCountJulkas1;
  225.   type
  226.     TIntLess = TLess<LongInt>;
  227.     TIntVect = TVector<LongInt>;
  228.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  229.   const
  230.     bsz = 1 shl 17;
  231.   var
  232.     sc: array[0..2147483647 shr 17] of TIntVect;
  233.     i: LongInt;
  234.     pkey, key, cnt: LongInt;
  235.     offset: LongInt;
  236.     InOut: Text;
  237.   begin
  238.     routineName := {$I %currentroutine%};
  239.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  240.  
  241.     Assign(InOut, inFilename);
  242.     Reset(InOut);
  243.     while not EOF(InOut) do
  244.       begin
  245.         {$I-}ReadLn(InOut, key);{$I+}
  246.         Inc(Total);
  247.         sc[key shr 17].PushBack(key and $1FFFF);
  248.       end;
  249.     Close(InOut);
  250.  
  251.     Assign(InOut, outFilename);
  252.     Rewrite(InOut);
  253.  
  254.     offset := -bsz;
  255.     for i := Low(sc) to High(sc) do
  256.     begin
  257.       Inc(offset, bsz);
  258.       pkey := -1;
  259.       cnt := 0;
  260.       if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  261.       for key in sc[i] do
  262.       begin
  263.         if pkey <> key then
  264.         begin
  265.           if cnt <> 0 then
  266.           begin
  267.             {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
  268.             Inc(Unique);
  269.           end;
  270.           pkey := key;
  271.           cnt := 0;
  272.         end;
  273.         Inc(cnt);
  274.       end;
  275.       if cnt <> 0 then
  276.       begin
  277.         {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
  278.         Inc(Unique);
  279.       end;
  280.     end;
  281.  
  282.     Close(InOut);
  283.     for i := Low(sc) to High(sc) do sc[i].Destroy;
  284.   end;
  285.  
  286.   procedure SortCountJulkas2;
  287.   type
  288.     TIntLess = TLess<LongInt>;
  289.     TIntVect = TVector<LongInt>;
  290.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  291.   var
  292.     sc: array[0..21474] of TIntVect;
  293.     i: LongInt;
  294.     pkey, key, cnt: LongInt;
  295.     offset: LongInt;
  296.     InOut: Text;
  297.   begin
  298.     routineName := {$I %currentroutine%};
  299.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  300.  
  301.     Assign(InOut, inFilename);
  302.     Reset(InOut);
  303.     while not EOF(InOut) do
  304.       begin
  305.         {$I-}ReadLn(InOut, key);{$I+}
  306.         Inc(Total);
  307.         sc[key div 100000].PushBack(key mod 100000);
  308.       end;
  309.     Close(InOut);
  310.  
  311.     Assign(InOut, outFilename);
  312.     Rewrite(InOut);
  313.  
  314.     for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  315.  
  316.     offset := -100000;
  317.     for i := Low(sc) to High(sc) do
  318.     begin
  319.       Inc(offset, 100000);
  320.       pkey := -1;
  321.       cnt := 0;
  322.       for key in sc[i] do
  323.       begin
  324.         if pkey <> key then
  325.         begin
  326.           if cnt <> 0 then
  327.           begin
  328.             {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
  329.             Inc(Unique);
  330.           end;
  331.           pkey := key;
  332.           cnt := 0;
  333.         end;
  334.         Inc(cnt);
  335.       end;
  336.       if cnt <> 0 then
  337.       begin
  338.         {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
  339.         Inc(Unique);
  340.       end;
  341.     end;
  342.  
  343.     Close(InOut);
  344.     for i := Low(sc) to High(sc) do
  345.       sc[i].Destroy;
  346.   end;
  347.  
  348.   procedure SortCount440bx;
  349.   begin
  350.     routineName := {$I %currentroutine%};
  351.     WinSortCount3.DataCount := 0;
  352.     WinSortCount3.Unique := 0;
  353.     WinSortCount3.InFileName := inFilename;
  354.     WinSortCount3.OutFileName := outFilename;
  355.     WinSortCount3.SortCount;
  356.     Total := WinSortCount3.DataCount;
  357.     Unique := WinSortCount3.Unique;
  358.   end;
  359.  
  360.   procedure Run(aProc: TProcedure);
  361.   begin
  362.     Total := 0;
  363.     Unique := 0;
  364.     Start := Now;
  365.     try
  366.       aProc();
  367.       WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
  368.     except
  369.       on e: Exception do
  370.         WriteLn('crashes with message "', e.Message, '"');
  371.     end;
  372.   end;
  373.  
  374. begin
  375.   Randomize;
  376.  
  377.   procedures := TProcedureArray.Create(
  378.                    SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
  379.                    SortCountAvk2, SortCount440bx);
  380.  
  381.   for randomrange := 1 to 10 do
  382.     begin
  383.       GenerateData(randomrange);
  384.       WriteLn(#10'RandomRange = ',randomrange);
  385.       for proc in procedures do
  386.         Run(proc);
  387.     end;
  388.  
  389.   for repeatCount := 1 to 10 do
  390.     begin
  391.       GenerateData(8, 2*repeatCount);
  392.       WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
  393.       for proc in procedures do
  394.         Run(proc);
  395.     end;
  396. end.
  397.  

output:
Code: Text  [Select][+][-]
  1.  
  2. RandomRange = 1
  3. Julkas1's time: 0.7800 #unique: 100000 #total: 2000000
  4. Julkas2's time: 0.7650 #unique: 100000 #total: 2000000
  5.   Akira's time: 0.7170 #unique: 100000 #total: 2000000
  6.  Howard's time: 0.9830 #unique: 100000 #total: 2000000
  7.    Avk1's time: 0.6400 #unique: 100000 #total: 2000000
  8.    Avk2's time: 0.4990 #unique: 100000 #total: 2000000
  9.   440bx's time: 0.4990 #unique: 100000 #total: 2000000
  10.  
  11. RandomRange = 2
  12. Julkas1's time: 0.7800 #unique: 199994 #total: 2000000
  13. Julkas2's time: 0.8270 #unique: 199994 #total: 2000000
  14.   Akira's time: 0.8730 #unique: 199994 #total: 2000000
  15.  Howard's time: 1.0450 #unique: 199994 #total: 2000000
  16.    Avk1's time: 0.7650 #unique: 199994 #total: 2000000
  17.    Avk2's time: 0.5300 #unique: 199994 #total: 2000000
  18.   440bx's time: 0.5460 #unique: 199994 #total: 2000000
  19.  
  20. RandomRange = 3
  21. Julkas1's time: 0.8270 #unique: 299625 #total: 2000000
  22. Julkas2's time: 0.8210 #unique: 299625 #total: 2000000
  23.   Akira's time: 0.9750 #unique: 299625 #total: 2000000
  24.  Howard's time: 1.0570 #unique: 299625 #total: 2000000
  25.    Avk1's time: 0.7830 #unique: 299625 #total: 2000000
  26.    Avk2's time: 0.5650 #unique: 299625 #total: 2000000
  27.   440bx's time: 0.5920 #unique: 299625 #total: 2000000
  28.  
  29. RandomRange = 4
  30. Julkas1's time: 0.8330 #unique: 397259 #total: 2000000
  31. Julkas2's time: 0.8300 #unique: 397259 #total: 2000000
  32.   Akira's time: 1.0660 #unique: 397259 #total: 2000000
  33.  Howard's time: 1.0970 #unique: 397259 #total: 2000000
  34.    Avk1's time: 0.8480 #unique: 397259 #total: 2000000
  35.    Avk2's time: 0.5800 #unique: 397259 #total: 2000000
  36.   440bx's time: 0.6270 #unique: 397259 #total: 2000000
  37.  
  38. RandomRange = 5
  39. Julkas1's time: 0.8630 #unique: 490899 #total: 2000000
  40. Julkas2's time: 0.8600 #unique: 490899 #total: 2000000
  41.   Akira's time: 1.1640 #unique: 490899 #total: 2000000
  42.  Howard's time: 1.1590 #unique: 490899 #total: 2000000
  43.    Avk1's time: 0.9890 #unique: 490899 #total: 2000000
  44.    Avk2's time: 0.6720 #unique: 490899 #total: 2000000
  45.   440bx's time: 0.6890 #unique: 490899 #total: 2000000
  46.  
  47. RandomRange = 6
  48. Julkas1's time: 0.9110 #unique: 578653 #total: 2000000
  49. Julkas2's time: 0.9180 #unique: 578653 #total: 2000000
  50.   Akira's time: 1.3110 #unique: 578653 #total: 2000000
  51.  Howard's time: 1.2000 #unique: 578653 #total: 2000000
  52.    Avk1's time: 0.9590 #unique: 578653 #total: 2000000
  53.    Avk2's time: 0.6510 #unique: 578653 #total: 2000000
  54.   440bx's time: 0.6900 #unique: 578653 #total: 2000000
  55.  
  56. RandomRange = 7
  57. Julkas1's time: 0.9810 #unique: 659394 #total: 2000000
  58. Julkas2's time: 0.9070 #unique: 659394 #total: 2000000
  59.   Akira's time: 1.3500 #unique: 659394 #total: 2000000
  60.  Howard's time: 1.2490 #unique: 659394 #total: 2000000
  61.    Avk1's time: 0.9820 #unique: 659394 #total: 2000000
  62.    Avk2's time: 0.7210 #unique: 659394 #total: 2000000
  63.   440bx's time: 0.6840 #unique: 659394 #total: 2000000
  64.  
  65. RandomRange = 8
  66. Julkas1's time: 0.9530 #unique: 734160 #total: 2000000
  67. Julkas2's time: 0.9200 #unique: 734160 #total: 2000000
  68.   Akira's time: 1.4140 #unique: 734160 #total: 2000000
  69.  Howard's time: 1.3470 #unique: 734160 #total: 2000000
  70.    Avk1's time: 1.0500 #unique: 734160 #total: 2000000
  71.    Avk2's time: 0.7050 #unique: 734160 #total: 2000000
  72.   440bx's time: 0.6900 #unique: 734160 #total: 2000000
  73.  
  74. RandomRange = 9
  75. Julkas1's time: 0.9410 #unique: 802680 #total: 2000000
  76. Julkas2's time: 0.9270 #unique: 802680 #total: 2000000
  77.   Akira's time: 1.5210 #unique: 802680 #total: 2000000
  78.  Howard's time: 1.2700 #unique: 802680 #total: 2000000
  79.    Avk1's time: 1.0660 #unique: 802680 #total: 2000000
  80.    Avk2's time: 0.7350 #unique: 802680 #total: 2000000
  81.   440bx's time: 0.7080 #unique: 802680 #total: 2000000
  82.  
  83. RandomRange = 10
  84. Julkas1's time: 0.9380 #unique: 864917 #total: 2000000
  85. Julkas2's time: 0.9580 #unique: 864917 #total: 2000000
  86.   Akira's time: 1.5790 #unique: 864917 #total: 2000000
  87.  Howard's time: 1.3090 #unique: 864917 #total: 2000000
  88.    Avk1's time: 1.1260 #unique: 864917 #total: 2000000
  89.    Avk2's time: 0.7490 #unique: 864917 #total: 2000000
  90.   440bx's time: 0.7210 #unique: 864917 #total: 2000000
  91.  
  92. repeatMillionsCount = 2
  93. Julkas1's time: 0.9690 #unique: 734280 #total: 2000000
  94. Julkas2's time: 0.9640 #unique: 734280 #total: 2000000
  95.   Akira's time: 1.4350 #unique: 734280 #total: 2000000
  96.  Howard's time: 1.3070 #unique: 734280 #total: 2000000
  97.    Avk1's time: 1.0430 #unique: 734280 #total: 2000000
  98.    Avk2's time: 0.7050 #unique: 734280 #total: 2000000
  99.   440bx's time: 0.6900 #unique: 734280 #total: 2000000
  100.  
  101. repeatMillionsCount = 4
  102. Julkas1's time: 1.6610 #unique: 794661 #total: 4000000
  103. Julkas2's time: 1.6720 #unique: 794661 #total: 4000000
  104.   Akira's time: 2.2260 #unique: 794661 #total: 4000000
  105.  Howard's time: 2.4500 #unique: 794661 #total: 4000000
  106.    Avk1's time: 1.7420 #unique: 794661 #total: 4000000
  107.    Avk2's time: 1.1830 #unique: 794661 #total: 4000000
  108.   440bx's time: 1.2800 #unique: 794661 #total: 4000000
  109.  
  110. repeatMillionsCount = 6
  111. Julkas1's time: 2.3830 #unique: 799556 #total: 6000000
  112. Julkas2's time: 2.3840 #unique: 799556 #total: 6000000
  113.   Akira's time: 2.9000 #unique: 799556 #total: 6000000
  114.  Howard's time: 3.2960 #unique: 799556 #total: 6000000
  115.    Avk1's time: 2.3930 #unique: 799556 #total: 6000000
  116.    Avk2's time: 1.6810 #unique: 799556 #total: 6000000
  117.   440bx's time: 1.8210 #unique: 799556 #total: 6000000
  118.  
  119. repeatMillionsCount = 8
  120. Julkas1's time: 3.1210 #unique: 799967 #total: 8000000
  121. Julkas2's time: 3.1040 #unique: 799967 #total: 8000000
  122.   Akira's time: 3.6550 #unique: 799967 #total: 8000000
  123.  Howard's time: 4.3260 #unique: 799967 #total: 8000000
  124.    Avk1's time: 3.0250 #unique: 799967 #total: 8000000
  125.    Avk2's time: 2.1500 #unique: 799967 #total: 8000000
  126.   440bx's time: 2.3370 #unique: 799967 #total: 8000000
  127.  
  128. repeatMillionsCount = 10
  129. Julkas1's time: 3.8410 #unique: 799997 #total: 10000000
  130. Julkas2's time: 3.8310 #unique: 799997 #total: 10000000
  131.   Akira's time: 4.3570 #unique: 799997 #total: 10000000
  132.  Howard's time: 5.3620 #unique: 799997 #total: 10000000
  133.    Avk1's time: 3.6840 #unique: 799997 #total: 10000000
  134.    Avk2's time: 2.6500 #unique: 799997 #total: 10000000
  135.   440bx's time: 2.8860 #unique: 799997 #total: 10000000
  136.  
  137. repeatMillionsCount = 12
  138. Julkas1's time: 4.5800 #unique: 799999 #total: 12000000
  139. Julkas2's time: 4.5300 #unique: 799999 #total: 12000000
  140.   Akira's time: 5.1050 #unique: 799999 #total: 12000000
  141.  Howard's time: 6.3740 #unique: 799999 #total: 12000000
  142.    Avk1's time: 4.4770 #unique: 799999 #total: 12000000
  143.    Avk2's time: 3.2540 #unique: 799999 #total: 12000000
  144.   440bx's time: 3.4680 #unique: 799999 #total: 12000000
  145.  
  146. repeatMillionsCount = 14
  147. Julkas1's time: 5.3170 #unique: 800000 #total: 14000000
  148. Julkas2's time: 5.3510 #unique: 800000 #total: 14000000
  149.   Akira's time: 6.1320 #unique: 800000 #total: 14000000
  150.  Howard's time: 7.5900 #unique: 800000 #total: 14000000
  151.    Avk1's time: 5.2320 #unique: 800000 #total: 14000000
  152.    Avk2's time: 3.5410 #unique: 800000 #total: 14000000
  153.   440bx's time: 3.9310 #unique: 800000 #total: 14000000
  154.  
  155. repeatMillionsCount = 16
  156. Julkas1's time: 6.4290 #unique: 800000 #total: 16000000
  157. Julkas2's time: 6.0400 #unique: 800000 #total: 16000000
  158.   Akira's time: 6.5250 #unique: 800000 #total: 16000000
  159.  Howard's time: 8.4090 #unique: 800000 #total: 16000000
  160.    Avk1's time: 5.5970 #unique: 800000 #total: 16000000
  161.    Avk2's time: 3.9940 #unique: 800000 #total: 16000000
  162.   440bx's time: 4.4460 #unique: 800000 #total: 16000000
  163.  
  164. repeatMillionsCount = 18
  165. Julkas1's time: 6.7460 #unique: 800000 #total: 18000000
  166. Julkas2's time: 6.7230 #unique: 800000 #total: 18000000
  167.   Akira's time: 7.2390 #unique: 800000 #total: 18000000
  168.  Howard's time: 9.3290 #unique: 800000 #total: 18000000
  169.    Avk1's time: 6.4050 #unique: 800000 #total: 18000000
  170.    Avk2's time: 4.8370 #unique: 800000 #total: 18000000
  171.   440bx's time: 5.3630 #unique: 800000 #total: 18000000
  172.  
  173. repeatMillionsCount = 20
  174. Julkas1's time: 7.6870 #unique: 800000 #total: 20000000
  175. Julkas2's time: 7.5800 #unique: 800000 #total: 20000000
  176.   Akira's time: 8.2640 #unique: 800000 #total: 20000000
  177.  Howard's time: 10.4280 #unique: 800000 #total: 20000000
  178.    Avk1's time: 6.8850 #unique: 800000 #total: 20000000
  179.    Avk2's time: 4.9860 #unique: 800000 #total: 20000000
  180.   440bx's time: 5.5030 #unique: 800000 #total: 20000000
  181.  
Title: Re: Sorting and Counting
Post by: jamie on July 21, 2019, 06:54:42 pm
I would like to get in on this adventure myself  :D

However, you guys wouldn't like my version because it would involve using Assembler code.  >:D
Title: Re: Sorting and Counting
Post by: julkas on July 21, 2019, 07:17:42 pm
I would like to get in on this adventure myself  :D

However, you guys wouldn't like my version because it would involve using Assembler code.  >:D
Go on. You are welcome.  ;)
Title: Re: Sorting and Counting
Post by: 440bx on July 21, 2019, 08:50:36 pm
Benchmark is compiled with a 32-bit compiler and runs on 64-bit Windows7.
Thank you. 

If it isn't too much trouble, I'd like to see the results when compiled for 64bit. 
Title: Re: Sorting and Counting
Post by: VTwin on July 21, 2019, 10:06:38 pm
I would like to get in on this adventure myself  :D

However, you guys wouldn't like my version because it would involve using Assembler code.  >:D

I'd be interested, as long as it is simple and cross-platform. :D
Title: Re: Sorting and Counting
Post by: ASerge on July 22, 2019, 12:04:16 am
@ASerge, I'm not sure about rtl-generics, but LGenerics is definitely incompatible with FPC 3.0.4.
OK. What about project with FPC 3.3.1?
Title: Re: Sorting and Counting
Post by: avk on July 22, 2019, 02:44:36 am
OK. What about project with FPC 3.3.1?
And what's wrong with the project for 3.3.1?

@ 440BX
64-bit version, it seems you won  :):
Code: Text  [Select][+][-]
  1. RandomRange = 5
  2. Julkas1's time: 0.7330  #unique: 490717 #total: 2000000
  3. Julkas2's time: 0.7490  #unique: 490717 #total: 2000000
  4.   Akira's time: 0.9050  #unique: 490717 #total: 2000000
  5.  Howard's time: 1.1230  #unique: 490717 #total: 2000000
  6.    Avk1's time: 0.8420  #unique: 490717 #total: 2000000
  7.    Avk2's time: 0.5780  #unique: 490717 #total: 2000000
  8.   440bx's time: 0.4520  #unique: 490717 #total: 2000000
  9.  
  10. RandomRange = 6
  11. Julkas1's time: 0.7640  #unique: 578502 #total: 2000000
  12. Julkas2's time: 0.7490  #unique: 578502 #total: 2000000
  13.   Akira's time: 0.9510  #unique: 578502 #total: 2000000
  14.  Howard's time: 1.0770  #unique: 578502 #total: 2000000
  15.    Avk1's time: 0.8890  #unique: 578502 #total: 2000000
  16.    Avk2's time: 0.5930  #unique: 578502 #total: 2000000
  17.   440bx's time: 0.5150  #unique: 578502 #total: 2000000
  18.  
  19. RandomRange = 7
  20. Julkas1's time: 0.7800  #unique: 659891 #total: 2000000
  21. Julkas2's time: 0.7800  #unique: 659891 #total: 2000000
  22.   Akira's time: 0.9830  #unique: 659891 #total: 2000000
  23.  Howard's time: 1.1890  #unique: 659891 #total: 2000000
  24.    Avk1's time: 0.9530  #unique: 659891 #total: 2000000
  25.    Avk2's time: 0.6490  #unique: 659891 #total: 2000000
  26.   440bx's time: 0.4960  #unique: 659891 #total: 2000000
  27.  
  28. RandomRange = 8
  29. Julkas1's time: 0.7990  #unique: 734164 #total: 2000000
  30. Julkas2's time: 0.8200  #unique: 734164 #total: 2000000
  31.   Akira's time: 1.0450  #unique: 734164 #total: 2000000
  32.  Howard's time: 1.2610  #unique: 734164 #total: 2000000
  33.    Avk1's time: 1.1300  #unique: 734164 #total: 2000000
  34.    Avk2's time: 0.6810  #unique: 734164 #total: 2000000
  35.   440bx's time: 0.4940  #unique: 734164 #total: 2000000
  36.  
  37. RandomRange = 9
  38. Julkas1's time: 0.9130  #unique: 802348 #total: 2000000
  39. Julkas2's time: 1.0760  #unique: 802348 #total: 2000000
  40.   Akira's time: 1.3740  #unique: 802348 #total: 2000000
  41.  Howard's time: 1.2440  #unique: 802348 #total: 2000000
  42.    Avk1's time: 1.0330  #unique: 802348 #total: 2000000
  43.    Avk2's time: 0.6590  #unique: 802348 #total: 2000000
  44.   440bx's time: 0.5400  #unique: 802348 #total: 2000000
  45.  
  46. RandomRange = 10
  47. Julkas1's time: 0.8330  #unique: 864249 #total: 2000000
  48. Julkas2's time: 0.8140  #unique: 864249 #total: 2000000
  49.   Akira's time: 1.1120  #unique: 864249 #total: 2000000
  50.  Howard's time: 1.1760  #unique: 864249 #total: 2000000
  51.    Avk1's time: 1.0670  #unique: 864249 #total: 2000000
  52.    Avk2's time: 0.6760  #unique: 864249 #total: 2000000
  53.   440bx's time: 0.5330  #unique: 864249 #total: 2000000
  54.  
  55. repeatMillionsCount = 10
  56. Julkas1's time: 4.4350  #unique: 5709324 #total: 10000000
  57. Julkas2's time: 4.4840  #unique: 5709324 #total: 10000000
  58.   Akira's time: 6.5020  #unique: 5709324 #total: 10000000
  59.  Howard's time: 6.8220  #unique: 5709324 #total: 10000000
  60.    Avk1's time: 6.0550  #unique: 5709324 #total: 10000000
  61.    Avk2's time: 4.4340  #unique: 5709324 #total: 10000000
  62.   440bx's time: 3.8650  #unique: 5709324 #total: 10000000
  63.  
  64. repeatMillionsCount = 12
  65. Julkas1's time: 5.2520  #unique: 6216581 #total: 12000000
  66. Julkas2's time: 5.3630  #unique: 6216581 #total: 12000000
  67.   Akira's time: 7.6030  #unique: 6216581 #total: 12000000
  68.  Howard's time: 7.6950  #unique: 6216581 #total: 12000000
  69.    Avk1's time: 7.4680  #unique: 6216581 #total: 12000000
  70.    Avk2's time: 5.1700  #unique: 6216581 #total: 12000000
  71.   440bx's time: 4.4730  #unique: 6216581 #total: 12000000
  72.  
  73. repeatMillionsCount = 14
  74. Julkas1's time: 5.9630  #unique: 6609319 #total: 14000000
  75. Julkas2's time: 5.9450  #unique: 6609319 #total: 14000000
  76.   Akira's time: 8.7800  #unique: 6609319 #total: 14000000
  77.  Howard's time: 8.9000  #unique: 6609319 #total: 14000000
  78.    Avk1's time: 8.1640  #unique: 6609319 #total: 14000000
  79.    Avk2's time: 5.8020  #unique: 6609319 #total: 14000000
  80.   440bx's time: 5.0030  #unique: 6609319 #total: 14000000
  81.  
  82. repeatMillionsCount = 16
  83. Julkas1's time: 6.6950  #unique: 6917359 #total: 16000000
  84. Julkas2's time: 6.6010  #unique: 6917359 #total: 16000000
  85.   Akira's time: 9.5630  #unique: 6917359 #total: 16000000
  86.  Howard's time: 9.8050  #unique: 6917359 #total: 16000000
  87.    Avk1's time: 8.9960  #unique: 6917359 #total: 16000000
  88.    Avk2's time: 6.3830  #unique: 6917359 #total: 16000000
  89.   440bx's time: 5.5600  #unique: 6917359 #total: 16000000
  90.  
  91. repeatMillionsCount = 18
  92. Julkas1's time: 7.2430  #unique: 7157162 #total: 18000000
  93. Julkas2's time: 7.2440  #unique: 7157162 #total: 18000000
  94.   Akira's time: 10.1980 #unique: 7157162 #total: 18000000
  95.  Howard's time: 11.2310 #unique: 7157162 #total: 18000000
  96.    Avk1's time: 9.6480  #unique: 7157162 #total: 18000000
  97.    Avk2's time: 7.1490  #unique: 7157162 #total: 18000000
  98.   440bx's time: 5.9770  #unique: 7157162 #total: 18000000
  99.  
  100. repeatMillionsCount = 20
  101. Julkas1's time: 8.1910  #unique: 7343071 #total: 20000000
  102. Julkas2's time: 7.9810  #unique: 7343071 #total: 20000000
  103.   Akira's time: 11.0300 #unique: 7343071 #total: 20000000
  104.  Howard's time: 12.1460 #unique: 7343071 #total: 20000000
  105.    Avk1's time: 10.5860 #unique: 7343071 #total: 20000000
  106.    Avk2's time: 7.9840  #unique: 7343071 #total: 20000000
  107.   440bx's time: 6.5130  #unique: 7343071 #total: 20000000
  108.  

Please excuse me, but for some time I will not be able to attend the forum.
Title: Re: Sorting and Counting
Post by: 440bx on July 22, 2019, 03:04:54 am
@ 440BX
64-bit version, it seems you won  :):
but only for 64bit.  In 32bit, you win :). The cost of pushing and popping parameters is just too high.
Title: Re: Sorting and Counting
Post by: ASerge on July 22, 2019, 12:09:02 pm
And what's wrong with the project for 3.3.1?
Attach it , please.
Title: Re: Sorting and Counting
Post by: julkas on July 23, 2019, 02:57:05 pm
I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.
So my fcl-stl TVector solution is better than Akira's generics.collections TDictionary and better than generics.collections TList.
Strange ... Who can explain this ? :-(

Also generics.collections is way faster and more modern ...
Is it true ?
Title: Re: Sorting and Counting
Post by: Thaddy on July 23, 2019, 06:58:52 pm
Usually, yes, but mind the remarks I made in the other thread: it is about some internals. In this thread somebody used those low-level adjustments. See if you can spot who did... :-X
The uses clause is usually a dead giveaway.... Analyze the inheritance...
Title: Re: Sorting and Counting
Post by: julkas on July 23, 2019, 08:32:25 pm
Usually, yes, but mind the remarks I made in the other thread: it is about some internals. In this thread somebody used those low-level adjustments. See if you can spot who did... :-X
The uses clause is usually a dead giveaway.... Analyze the inheritance...
In this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
In real life I can't use even fcl-stl. 80% of my Pascal code (DS, algorithm) is written from scratch.
Title: Re: Sorting and Counting
Post by: 440bx on July 23, 2019, 09:10:19 pm
I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.
You can legitimately say that about the algorithms you used to solve the problem, I _cannot_ say that about the implementations I presented, I traded cleanliness for speed. 

At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest.  Its downside is, in some cases, it can take a lot more memory than desirable.

When everything is taken into account, I believe Avk implementation number 2 is the best one. 
Title: Re: Sorting and Counting
Post by: VTwin on July 23, 2019, 09:43:56 pm
In this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
In real life I can't use even fcl-stl. 80% of my Pascal code (DS, algorithm) is written from scratch.

I don't have (ready) access to a generics library, so I just replaced one line of code in Avk2. Not a speed record or new suggestion, but it runs pretty quick. I believe Avk2 uses an Introsort.

Code: Pascal  [Select][+][-]
  1. procedure Quicksort(var a: IVector; left, right: integer);
  2. var
  3.   i, j: integer;
  4.   x: integer;
  5. begin
  6.   if (right <= left) or (left < 0) then
  7.     exit;
  8.   if (right - left + 1) < MinQSortElem then begin
  9.     Insertionsort(a, left, right);
  10.     exit;
  11.   end;
  12.   i := left;
  13.   j := right;
  14.   x := a[(left + right) div 2];
  15.   repeat
  16.     while (Compare(a[i], x) = -1) do
  17.       i := i + 1;
  18.     while (Compare(x, a[j]) = -1) do
  19.       j := j - 1;
  20.     if i <= j then begin
  21.       Swap(a[i], a[j]);
  22.       i := i + 1;
  23.       j := j - 1;
  24.     end;
  25.   until i > j;
  26.   if left < j then
  27.     Quicksort(a, left, j);
  28.   if i < right then
  29.     Quicksort(a, i, right);
  30. end;  

Of course the Insertionsort is unnecessary here, and Compare can be replaced with "<".
Title: Re: Sorting and Counting
Post by: ASerge on July 23, 2019, 09:59:13 pm
At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest. 
+1
That's why the results seem strange to me, because Howard's algorithm is O(n), and algorithms with sorts is O(n*lg(n)). But it is impossible to check, @avk did not provide the project.
Title: Re: Sorting and Counting
Post by: howardpc on July 23, 2019, 11:30:03 pm
I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.
At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest.  Its downside is, in some cases, it can take a lot more memory than desirable.
Replacing the Format() in my original implementation with a simple Writeln() gives a slight speed increase.
However my implementation has two drawbacks apart from initially high memory usage:
Title: Re: Sorting and Counting
Post by: 440bx on July 23, 2019, 11:30:22 pm
That's why the results seem strange to me, because Howard's algorithm is O(n), and algorithms with sorts is O(n*lg(n)). But it is impossible to check, @avk did not provide the project.
They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.

Here are some of the costs his algorithm incurs, some of which could be avoided and others simply can't.
Code: Pascal  [Select][+][-]
  1.     min: Integer = High(Integer);
  2.     max: Integer = -1;
  3.   begin
  4.    routineName := {$I %currentroutine%};
  5.    AssignFile(textf, inFilename);
  6.    Reset(textf);
  7.    while not EOF(textf) do
  8.     begin
  9.        ReadLn(textf, i);
  10.         Inc(Total);
  11.         if i < min then
  12.           min := i;
  13.         if i > max then
  14.           max := i;
  15.      end;
  16.  
He has to walk the entire data file to determine min and max.  If in the above snippet of code, both min and max had been initialized to the first element in the data file then it would not be necessary to always do two (2) comparisons (one against min and another one against max.)  It could have been coded instead as
Code: Pascal  [Select][+][-]
  1.         if i < min then
  2.         begin
  3.           min := i;
  4.           continue;
  5.         end;
  6.         if i > max then
  7.           max := i;
  8.  
Thereby, for elements lower than the current min, avoiding a now unnecessary second comparison against max.  The gains would, of course, depend on the file's data arrangement.

The instruction
Code: Pascal  [Select][+][-]
  1.     SetLength(arr, max-min+1);  
is very expensive because it has to set (max - min + 1) number of elements - which, due to the large number of duplicates in the data file and their large range, will be significantly greater than the number of elements in the data file - to zero. 

Directly related to the above,
Code: Pascal  [Select][+][-]
  1.       case (arr[i] > 0) of
  2.         True:
  3.           begin
  4.             WriteLn(textf, i+min, ' - ', arr[i]);
  5.             Inc(Unique);
  6.           end;
  7.       end;
  8.  
The number of comparisons needed to "weed out" superfluous buckets (=0) is large (due to the large min, max range) and the large number of duplicates.

The larger number of comparisons due to the data range and the need to zero out a large memory area, is enough to nullify the gains of his O(n) algorithm.


ETA:

@Howard

Our posts crossed.  You are absolutely right.
Title: Re: Sorting and Counting
Post by: ASerge on July 23, 2019, 11:45:06 pm
They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.
But for large n it's better. And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.
Title: Re: Sorting and Counting
Post by: 440bx on July 24, 2019, 12:05:14 am
They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.
But for large n it's better.
Yes, provided that the range is reasonably close to n.  As the ratio of range/n increases, a radix sort suffers.

And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.
True, the concern is, once the algorithm uses knowledge about the data format it didn't determine itself, the algorithm implementation may lose generality.
Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 04:18:25 am
I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.
Just curious how much performance has degraded?

... I believe Avk2 uses an Introsort.
All LGArrayHelpers sorting algorithms are hybrid, in particular TGOrdinalArrayHelper.Sort tries to use
Counting Sort, if possible, otherwise uses Introsort. 

... But it is impossible to check, @avk did not provide the project.
Let me guess, the project of which you always mention this is LGenerics? If so, see attachment.
Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 06:23:04 am
Just in case, I decided to check the coincidence of the results of the existing solutions.
A curious fact, all coincide, except SortCount440bx.

...So my fcl-stl TVector solution is better than Akira's generics.collections TDictionary...
This is highly dependent on the input data.
Title: Re: Sorting and Counting
Post by: 440bx on July 31, 2019, 06:44:57 am
A curious fact, all coincide, except SortCount440bx.
that's because of the compare function.  The number of instances of each number is correct but, the collating sequence is different than what is obtained in a numerical comparison.
Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 06:56:08 am
@440bx, thank you, now I understand the reason.
Title: Re: Sorting and Counting
Post by: hnb on July 31, 2019, 09:13:58 am
Hello,

First : rtl-generics TDictionary seems like strange choice. rtl-generics has better collection for such purpose (sort + hash map). IMO should be used TSortedHashSet<T> with custom AComparer and AEqualityComparer, with using Capacity property.

Secondly: @Akira version for rtl-generics TDictionary is not optimized and has many redundant calls (also in second @avk test the Capacity is disabled which can decrease performance):


Code: Pascal  [Select][+][-]
  1.       if not Map.ContainsKey(I) then
  2.         begin
  3.           Map.Add(I, 1);
  4.           Inc(Unique);
  5.         end
  6.       else
  7.         Map[I] := Map[I] + 1;
  8.  

such code can be optimized, it can be reduced to something like this:

Code: Pascal  [Select][+][-]
  1. var
  2.   Counters: array of LongInt;
  3.   P: PLongInt;
  4.   Map: TDictionary<LongInt, PLongInt>;
  5.  
  6. ...
  7.  
  8.       if not Map.TryGetValue(I, P) then
  9.         begin
  10.           P := @Counters[Unique];
  11.           Map.Add(I, P);
  12.           Inc(Unique);
  13.         end;
  14.       Inc(P^);
  15.  

Above code will be much faster. The "reads/lookups" from dictionary is reduced to one, or to two during adding instead of 3 and 4 "reads/lookups".

Thirdly: for many reads from dictionary worth to test other dictionary from rtl-generics : TCuckooD2<TKey, TValue> / TFastHashMap<TKey, TValue>. Cuckoo should be faster for such case.
Title: Re: Sorting and Counting
Post by: julkas on July 31, 2019, 09:38:50 am
I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.
Just curious how much performance has degraded?
Hello @avk, @hnb ! (See - https://forum.lazarus.freepascal.org/index.php/topic,46254.0.html)
TList Code
Code: Pascal  [Select][+][-]
  1. program sc2;
  2. {$mode delphi}
  3.  
  4. uses SysUtils, Classes,
  5.     //gvector, gutil, garrayutils;
  6.     Generics.Defaults, Generics.Collections;
  7. const
  8.   keyNum = 10000000;
  9.   blckSize = 100000;
  10. //type
  11.   //TIntLess = TLess<LongInt>;
  12.   //TIntVect = TVector<LongInt>;
  13.   //TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  14.  
  15. var
  16.   //sc: array[0..2147483647 div blckSize] of TIntVect;
  17.   sc: array[0..2147483647 div blckSize] of Generics.Collections.TList<LongInt>;
  18.   i: LongInt;
  19.   pkey, key, cnt, uniq: LongInt;
  20.   offset: LongInt;
  21.   start: QWord;
  22.   outFile: Text;
  23.  
  24. begin
  25.   start := GetTickCount64();
  26.   //for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  27.   for i := Low(sc) to High(sc) do sc[i] := TList<LongInt>.Create;
  28.  
  29.   for i := 0 to keyNum do
  30.   begin
  31.     key := Random(2147483647);
  32.     //sc[key div blckSize].PushBack(key mod blckSize);
  33.     sc[key div 100000].Add(key mod 100000);
  34.   end;
  35.   WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  36.  
  37.   Assign(outFile, 'out.txt');
  38.   Rewrite(outFile);
  39.  
  40.   //for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  41.   for i := Low(sc) to High(sc) do if sc[i].Count > 1 then sc[i].Sort();
  42.   offset := -blckSize;
  43.   uniq := 0;
  44.   for i := Low(sc) to High(sc) do
  45.   begin
  46.     Inc(offset, blckSize);
  47.     pkey := -1;
  48.     cnt := 0;
  49.     for key in sc[i] do
  50.     begin
  51.       if pkey <> key then
  52.       begin
  53.         if cnt <> 0 then
  54.         begin
  55.           WriteLn(outFile, offset + pkey, ' - ', cnt);
  56.           Inc(uniq);
  57.         end;
  58.         pkey := key;
  59.         cnt := 0;
  60.       end;
  61.       Inc(cnt);
  62.     end;
  63.     if cnt <> 0 then
  64.     begin
  65.     WriteLn(outFile, offset + pkey, ' - ', cnt);
  66.     Inc(uniq);
  67.     end;
  68.   end;
  69.  
  70.   Close(outFile);
  71.   for i := Low(sc) to High(sc) do
  72.     sc[i].Destroy;
  73.  
  74.   WriteLn('Total (ticks) - ', GetTickCount64() - start);
  75.   WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  76.   ReadLn;
  77. end.
  78.  
TList output -
Code: Text  [Select][+][-]
  1. Populated (ticks) - 3282
  2. Total (ticks) - 7078
  3. Uniq keys - 9976566, out of - 10000000
TVector output -
Code: Text  [Select][+][-]
  1. Populated (ticks) - 1328
  2. Total (ticks) - 4718
  3. Uniq keys - 9976566, out of - 10000000
Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 11:27:14 am
@hnb, you are welcome :)
@julkas, I believe that the slowdown is mainly due to notifications and slower sorting in Generics.Collections.TList. The last time I compared, TArrayHelper.Sort was 1.5-1.6 times slower than TOrderingArrayUtils.Sort.
Title: Re: Sorting and Counting
Post by: BrunoK on July 31, 2019, 05:13:28 pm
Let's go with my entry, if it works and I have correctly solved the initial question.
All very traditional object pascal.
Code: Pascal  [Select][+][-]
  1.  
  2.   procedure SortCountBrunoK;  { Note : requires Classes }
  3.   const
  4.     cCR = $0D;
  5.     c0 = Ord('0');
  6.     function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
  7.     var
  8.       { Parse lines }
  9.       lPByte, lPEndByte: PByte;
  10.  
  11.       { Values extraction }
  12.       lPByteTextStart: PByte = nil;
  13.       lValueStarted: boolean = False;
  14.       lDWORD: DWORD;
  15.  
  16.       lCntRec: integer;
  17.     begin
  18.       { Prepare aList }
  19.       lCntRec := aMemStream.Size;
  20.       if lCntRec <= 0 then // Stream empty ?
  21.         exit(0);
  22.       aList.Count := lCntRec div 10; // Setup approximative size
  23.       aList.Count := 0;
  24.  
  25.       lPByte := PByte(aMemStream.memory);
  26.       lPEndByte := lPByte + aMemStream.Size;
  27.  
  28.       while lPByte <= lPEndByte do begin
  29.         if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
  30.           if lValueStarted then begin
  31.             lDWORD := 0;
  32.             while lPByteTextStart < lPByte do begin
  33.               lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
  34.               Inc(lPByteTextStart);
  35.             end;
  36.             aList.Add(Pointer(lDWORD));
  37.             lValueStarted := False;
  38.           end;
  39.         end
  40.         else if not lValueStarted then begin
  41.           lPByteTextStart := lPByte;
  42.           lValueStarted := True;
  43.         end;
  44.         Inc(lPByte);
  45.       end;
  46.       Result := aList.Count;
  47.     end;
  48.  
  49.     function BkCompare(Item1, Item2: Pointer): integer;
  50.     begin
  51.       Result := 1;
  52.       if Item1 < Item2 then
  53.         Result := -1
  54.       else if Item1 = Item2 then
  55.         Result := 0;
  56.     end;
  57.  
  58.   var
  59.     lFile: TextFile;
  60.     lMemStream: TMemoryStream;
  61.     lNbRecs: integer = 0;
  62.     lFPList: TFPList;
  63.     lIx: integer;
  64.     lLastValue: pointer;
  65.     lListCount: integer;
  66.     lLastValueCount: integer;
  67.   begin
  68.     routineName := 'SortCountBrunoK'; // {$I %currentroutine%};
  69.     lMemStream := TMemoryStream.Create;
  70.     lMemStream.LoadFromFile(inFileName);
  71.     lFPList := TFPList.Create;
  72.     lNbRecs := LoadStreamToList(lMemStream, lFPList);
  73.     lMemStream.Free; // Not needed anymore
  74.     if lNbRecs > 0 then begin
  75.       AssignFile(lFile, outFilename);
  76.       Rewrite(lFile);
  77.       lFPList.Sort(@BkCompare);
  78.       lIx := 0;
  79.       lLastValue := lFPList[lIx];
  80.       lLastValueCount := 1;
  81.       lListCount := lFPList.Count;
  82.       repeat
  83.         Inc(lIx);
  84.         if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
  85.           Inc(unique);
  86.           WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
  87.           if (lIx >= lListCount) then
  88.             Break;
  89.           lLastValue := lFPList[lIx];
  90.           lLastValueCount := 1;
  91.         end
  92.         else
  93.           Inc(lLastValueCount);
  94.       until False;
  95.       CloseFile(lFile);
  96.       Total := lIx;
  97.     end;
  98.     lFPList.Free;
  99.   end;
  100.  
If correct it runs 20% faster on 64 bit relative to 32 bit.
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 06:49:03 am
I have added your code to the benchmark:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   Classes, SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, garrayutils, gvector, gmap,
  12.   WinSortCount3;
  13.  
  14. type
  15.   TIntPair = TPair<LongInt, LongInt>;
  16.   TProcedureArray = array of procedure;
  17.  
  18.   function ComparePairs(constref L, R: TIntPair): LongInt;
  19.   begin
  20.     if L.Key < R.Key then
  21.       Result := -1
  22.     else if L.Key = R.Key then
  23.       Result := 0
  24.     else
  25.       Result := 1;
  26.   end;
  27.  
  28. var
  29.   Total, Unique, repeatCount, randomrange: Integer;
  30.   Start: TDateTime;
  31.   inFilename: String = 'data.txt';
  32.   outFilename: String = 'sorted.txt';
  33.   routineName: String;
  34.   procedures: TProcedureArray;
  35.   proc: procedure;
  36.  
  37.   procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  38.   var
  39.     InFile: Text;
  40.     I: LongInt;
  41.   begin
  42.     Assign(InFile, inFilename);
  43.     Rewrite(InFile);
  44.     for I := 1 to repeatMillionsCount * 1000000 do
  45.       WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
  46.     Close(InFile);
  47.   end;
  48.  
  49.   procedure SortCountAkira;
  50.   var
  51.     I: LongInt;
  52.     InOut: Text;
  53.     Map: TDictionary<LongInt, LongInt>;
  54.     Pair: TIntPair;
  55.     Pairs: TArray<TIntPair>;
  56.   begin
  57.     routineName := {$I %currentroutine%};
  58.     Map := TDictionary<LongInt, LongInt>.Create();
  59.     //Map.Capacity := 10000000;
  60.     Assign(InOut, inFilename);
  61.     Reset(InOut);
  62.     while not EOF(InOut) do begin
  63.       ReadLn(InOut, I);
  64.       Inc(Total);
  65.       if not Map.ContainsKey(I) then
  66.         begin
  67.           Map.Add(I, 1);
  68.           Inc(Unique);
  69.         end
  70.       else
  71.         Map[I] := Map[I] + 1;
  72.     end;
  73.     Close(InOut);
  74.     Pairs := Map.ToArray();
  75.     Map.Free();
  76.     TArrayHelper<TIntPair>.Sort(
  77.       Pairs,
  78.       TComparer<TIntPair>.Construct(ComparePairs)
  79.     );
  80.     Assign(InOut, outFilename);
  81.     Rewrite(InOut);
  82.     for Pair in Pairs do with Pair do
  83.       WriteLn(InOut, Key, ' - ', Value);
  84.     Close(InOut);
  85.   end;
  86.  
  87.   procedure SortCountHoward;
  88.   var
  89.     arr: array of Integer;
  90.     textf: TextFile;
  91.     min: Integer = High(Integer);
  92.     max: Integer = -1;
  93.     i: Integer;
  94.   begin
  95.     routineName := {$I %currentroutine%};
  96.     AssignFile(textf, inFilename);
  97.     Reset(textf);
  98.     while not EOF(textf) do
  99.       begin
  100.         ReadLn(textf, i);
  101.         Inc(Total);
  102.         if i < min then
  103.           min := i;
  104.         if i > max then
  105.           max := i;
  106.       end;
  107.     SetLength(arr, max-min+1);
  108.  
  109.     Reset(textf);
  110.     while not EOF(textf) do
  111.       begin
  112.         ReadLn(textf, i);
  113.         Dec(i, min);
  114.         Inc(arr[i]);
  115.       end;
  116.     CloseFile(textf);
  117.  
  118.     AssignFile(textf, outFilename);
  119.     Rewrite(textf);
  120.     for i := Low(arr) to High(arr) do
  121.       case (arr[i] > 0) of
  122.         True:
  123.           begin
  124.             WriteLn(textf, i+min, ' - ', arr[i]);
  125.             Inc(Unique);
  126.           end;
  127.       end;
  128.     CloseFile(textf);
  129.     SetLength(arr, 0);
  130.   end;
  131.  
  132.   procedure SortCountAvk1;
  133.   type
  134.     TCounter  = TGHashMultiSetLP<Integer>;
  135.     TCountRef = TGAutoRef<TCounter>;
  136.     TEntry    = TCounter.TEntry;
  137.  
  138.     function EntryCmp(constref L, R: TEntry): SizeInt;
  139.     begin
  140.       if L.Key > R.Key then
  141.         Result := 1
  142.       else
  143.         if L.Key < R.Key then
  144.           Result := -1
  145.         else
  146.           Result := 0;
  147.     end;
  148.  
  149.   var
  150.     CountRef: TCountRef;
  151.     InOut: Text;
  152.     Counter: TCounter;
  153.     e: TEntry;
  154.     I: Integer;
  155.   begin
  156.     routineName := {$I %currentroutine%};
  157.     Counter := CountRef;
  158.     //Counter.LoadFactor := 0.7;
  159.     Assign(InOut, inFilename);
  160.     Reset(InOut);
  161.     while not EOF(InOut) do
  162.       begin
  163.         ReadLn(InOut, I);
  164.         Counter.Add(I);
  165.       end;
  166.     Close(InOut);
  167.     Total := Counter.Count;
  168.     Unique := Counter.EntryCount;
  169.     if Counter.NonEmpty then
  170.       begin
  171.         Assign(InOut, outFilename);
  172.         Rewrite(InOut);
  173.         for e in Counter.Entries.Sorted(EntryCmp) do
  174.           with e do
  175.             WriteLn(InOut, Key, ' - ', Count);
  176.         Close(InOut);
  177.       end;
  178.   end;
  179.  
  180.   procedure SortCountAvk2;
  181.   var
  182.     List: array of Integer;
  183.     InOut: Text;
  184.     I, J, Count, DupCount: Integer;
  185.   begin
  186.     routineName := {$I %currentroutine%};
  187.     Assign(InOut, inFilename);
  188.     Reset(InOut);
  189.     SetLength(List, 262144);
  190.     I := 0;
  191.     while not EOF(InOut) do
  192.       begin
  193.         ReadLn(InOut, J);
  194.         Inc(Total);
  195.         if Length(List) = I then
  196.           SetLength(List, I * 2);
  197.         List[I] := J;
  198.         Inc(I);
  199.       end;
  200.     Close(InOut);
  201.     SetLength(List, I);
  202.     if List = nil then
  203.       exit;
  204.     TGOrdinalArrayHelper<Integer>.Sort(List);
  205.     Count := I;
  206.     DupCount := 0;
  207.     I := 0;
  208.     Assign(InOut, outFilename);
  209.     Rewrite(InOut);
  210.     repeat
  211.       J := List[I];
  212.       while (I < Count) and (List[I] = J) do
  213.         begin
  214.           Inc(DupCount);
  215.           Inc(I);
  216.         end;
  217.       WriteLn(InOut, J, ' - ', DupCount);
  218.       Inc(Unique);
  219.       DupCount := 0;
  220.     until I = Count;
  221.     Close(InOut);
  222.   end;
  223.  
  224.   procedure SortCountJulkas1;
  225.   type
  226.     TIntLess = TLess<LongInt>;
  227.     TIntVect = TVector<LongInt>;
  228.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  229.   const
  230.     bsz = 1 shl 17;
  231.   var
  232.     sc: array[0..2147483647 shr 17] of TIntVect;
  233.     i: LongInt;
  234.     pkey, key, cnt: LongInt;
  235.     offset: LongInt;
  236.     InOut: Text;
  237.   begin
  238.     routineName := {$I %currentroutine%};
  239.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  240.  
  241.     Assign(InOut, inFilename);
  242.     Reset(InOut);
  243.     while not EOF(InOut) do
  244.       begin
  245.         ReadLn(InOut, key);
  246.         Inc(Total);
  247.         sc[key shr 17].PushBack(key and $1FFFF);
  248.       end;
  249.     Close(InOut);
  250.  
  251.     Assign(InOut, outFilename);
  252.     Rewrite(InOut);
  253.  
  254.     offset := -bsz;
  255.     for i := Low(sc) to High(sc) do
  256.     begin
  257.       Inc(offset, bsz);
  258.       pkey := -1;
  259.       cnt := 0;
  260.       if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  261.       for key in sc[i] do
  262.       begin
  263.         if pkey <> key then
  264.         begin
  265.           if cnt <> 0 then
  266.           begin
  267.             WriteLn(InOut, offset + pkey, ' - ', cnt);
  268.             Inc(Unique);
  269.           end;
  270.           pkey := key;
  271.           cnt := 0;
  272.         end;
  273.         Inc(cnt);
  274.       end;
  275.       if cnt <> 0 then
  276.       begin
  277.         WriteLn(InOut, offset + pkey, ' - ', cnt);
  278.         Inc(Unique);
  279.       end;
  280.     end;
  281.  
  282.     Close(InOut);
  283.     for i := Low(sc) to High(sc) do sc[i].Destroy;
  284.   end;
  285.  
  286.   procedure SortCountJulkas2;
  287.   type
  288.     TIntLess = TLess<LongInt>;
  289.     TIntVect = TVector<LongInt>;
  290.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  291.   var
  292.     sc: array[0..21474] of TIntVect;
  293.     i: LongInt;
  294.     pkey, key, cnt: LongInt;
  295.     offset: LongInt;
  296.     InOut: Text;
  297.   begin
  298.     routineName := {$I %currentroutine%};
  299.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  300.  
  301.     Assign(InOut, inFilename);
  302.     Reset(InOut);
  303.     while not EOF(InOut) do
  304.       begin
  305.         ReadLn(InOut, key);
  306.         Inc(Total);
  307.         sc[key div 100000].PushBack(key mod 100000);
  308.       end;
  309.     Close(InOut);
  310.  
  311.     Assign(InOut, outFilename);
  312.     Rewrite(InOut);
  313.  
  314.     for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  315.  
  316.     offset := -100000;
  317.     for i := Low(sc) to High(sc) do
  318.     begin
  319.       Inc(offset, 100000);
  320.       pkey := -1;
  321.       cnt := 0;
  322.       for key in sc[i] do
  323.       begin
  324.         if pkey <> key then
  325.         begin
  326.           if cnt <> 0 then
  327.           begin
  328.             WriteLn(InOut, offset + pkey, ' - ', cnt);
  329.             Inc(Unique);
  330.           end;
  331.           pkey := key;
  332.           cnt := 0;
  333.         end;
  334.         Inc(cnt);
  335.       end;
  336.       if cnt <> 0 then
  337.       begin
  338.         WriteLn(InOut, offset + pkey, ' - ', cnt);
  339.         Inc(Unique);
  340.       end;
  341.     end;
  342.  
  343.     Close(InOut);
  344.     for i := Low(sc) to High(sc) do
  345.       sc[i].Destroy;
  346.   end;
  347.  
  348.   procedure SortCount440bx;
  349.   begin
  350.     routineName := {$I %currentroutine%};
  351.     WinSortCount3.DataCount := 0;
  352.     WinSortCount3.Unique := 0;
  353.     WinSortCount3.InFileName := inFilename;
  354.     WinSortCount3.OutFileName := outFilename;
  355.     WinSortCount3.SortCount;
  356.     Total := WinSortCount3.DataCount;
  357.     Unique := WinSortCount3.Unique;
  358.   end;
  359.  
  360.   procedure SortCountBrunoK;  { Note : requires Classes }
  361.   const
  362.     cCR = $0D;
  363.     c0 = Ord('0');
  364.     function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
  365.     var
  366.       { Parse lines }
  367.       lPByte, lPEndByte: PByte;
  368.  
  369.       { Values extraction }
  370.       lPByteTextStart: PByte = nil;
  371.       lValueStarted: boolean = False;
  372.       lDWORD: DWORD;
  373.  
  374.       lCntRec: integer;
  375.     begin
  376.       { Prepare aList }
  377.       lCntRec := aMemStream.Size;
  378.       if lCntRec <= 0 then // Stream empty ?
  379.         exit(0);
  380.       aList.Count := lCntRec div 10; // Setup approximative size
  381.       aList.Count := 0;
  382.  
  383.       lPByte := PByte(aMemStream.memory);
  384.       lPEndByte := lPByte + aMemStream.Size;
  385.  
  386.       while lPByte <= lPEndByte do begin
  387.         if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
  388.           if lValueStarted then begin
  389.             lDWORD := 0;
  390.             while lPByteTextStart < lPByte do begin
  391.               lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
  392.               Inc(lPByteTextStart);
  393.             end;
  394.             aList.Add(Pointer(lDWORD));
  395.             lValueStarted := False;
  396.           end;
  397.         end
  398.         else if not lValueStarted then begin
  399.           lPByteTextStart := lPByte;
  400.           lValueStarted := True;
  401.         end;
  402.         Inc(lPByte);
  403.       end;
  404.       Result := aList.Count;
  405.     end;
  406.  
  407.     function BkCompare(Item1, Item2: Pointer): integer;
  408.     begin
  409.       Result := 1;
  410.       if Item1 < Item2 then
  411.         Result := -1
  412.       else if Item1 = Item2 then
  413.         Result := 0;
  414.     end;
  415.  
  416.   var
  417.     lFile: TextFile;
  418.     lMemStream: TMemoryStream;
  419.     lNbRecs: integer = 0;
  420.     lFPList: TFPList;
  421.     lIx: integer;
  422.     lLastValue: pointer;
  423.     lListCount: integer;
  424.     lLastValueCount: integer;
  425.   begin
  426.     routineName := 'SortCountBrunoK'; //{$I %currentroutine%};
  427.     lMemStream := TMemoryStream.Create;
  428.     lMemStream.LoadFromFile(inFileName);
  429.     lFPList := TFPList.Create;
  430.     lNbRecs := LoadStreamToList(lMemStream, lFPList);
  431.     lMemStream.Free; // Not needed anymore
  432.     if lNbRecs > 0 then begin
  433.       AssignFile(lFile, outFilename);
  434.       Rewrite(lFile);
  435.       lFPList.Sort(@BkCompare);
  436.       lIx := 0;
  437.       lLastValue := lFPList[lIx];
  438.       lLastValueCount := 1;
  439.       lListCount := lFPList.Count;
  440.       repeat
  441.         Inc(lIx);
  442.         if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
  443.           Inc(unique);
  444.           WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
  445.           if (lIx >= lListCount) then
  446.             Break;
  447.           lLastValue := lFPList[lIx];
  448.           lLastValueCount := 1;
  449.         end
  450.         else
  451.           Inc(lLastValueCount);
  452.       until False;
  453.       CloseFile(lFile);
  454.       Total := lIx;
  455.     end;
  456.     lFPList.Free;
  457.   end;
  458.  
  459.   procedure Run(aProc: TProcedure);
  460.   begin
  461.     Total := 0;
  462.     Unique := 0;
  463.     Start := Now;
  464.     try
  465.       aProc();
  466.       WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,#9'#unique: ',Unique,' #total: ',Total);
  467.     except
  468.       on e: Exception do
  469.         WriteLn('crashes with message "', e.Message, '"');
  470.     end;
  471.   end;
  472.  
  473. begin
  474.   Randomize;
  475.  
  476.   procedures := TProcedureArray.Create(
  477.                    SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
  478.                    SortCountAvk2, SortCount440bx, SortCountBrunoK);
  479.  
  480.   for randomrange := 1 to 10 do
  481.     begin
  482.       GenerateData(randomrange, 10);
  483.       WriteLn(#10'RandomRange = ',randomrange);
  484.       for proc in procedures do
  485.         Run(proc);
  486.     end;
  487.  
  488.   for repeatCount := 1 to 10 do
  489.     begin
  490.       GenerateData(8, 2*repeatCount);
  491.       WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
  492.       for proc in procedures do
  493.         Run(proc);
  494.     end;
  495. end.
  496.  

At least on my machine, the performance of your code is second only to 440bx one.
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 08:23:16 am
At least on my machine, the performance of your code is second only to 440bx one.
My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)

In the real world the different sequence would not be acceptable.

ETA:

Bruno's implementation can be made a smidgen faster by changing the compare function to test for equality first (since there are more duplicate values than unique), that would lower the number of comparisons required to determine relative magnitudes.
Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 08:58:58 am
Quote
ETA:

Bruno's implementation can be made a smidgen faster by changing the compare function to test for equality first (since there are more duplicate values than unique), that would lower the number of comparisons required to determine relative magnitudes.

I think that the compare function can be speeded up a lot by "linearizing" the code to eliminate data dependend branches like

Code: [Select]
  Result := 1 - integer( a=b ) - 2*integer( a<b );

At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 09:29:21 am
Code: [Select]
  Result := 1 - integer( a=b ) - 2*integer( a<b );
At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan
I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 09:46:58 am
My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)
And this can not be fixed?

@BrunoK, you really added fuel to the fire. :)
I wondered what would happen if I attached a similar trick to SortCountAvk2:
Code: Pascal  [Select][+][-]
  1. program OccurrenceCounter;
  2.  
  3. {$mode delphi}
  4. {$ImplicitExceptions Off}
  5. {$MODESWITCH NESTEDPROCVARS}
  6.  
  7. uses
  8.   Classes, SysUtils, DateUtils,
  9.   Generics.Defaults, Generics.Collections,
  10.   LGUtils, LGHashMultiSet, LGArrayHelpers,
  11.   gutil, garrayutils, gvector, gmap,
  12.   WinSortCount3;
  13.  
  14. type
  15.   TIntPair = TPair<LongInt, LongInt>;
  16.   TProcedureArray = array of procedure;
  17.  
  18.   function ComparePairs(constref L, R: TIntPair): LongInt;
  19.   begin
  20.     if L.Key < R.Key then
  21.       Result := -1
  22.     else if L.Key = R.Key then
  23.       Result := 0
  24.     else
  25.       Result := 1;
  26.   end;
  27.  
  28. var
  29.   Total, Unique, repeatCount, randomrange: Integer;
  30.   Start: TDateTime;
  31.   inFilename: String = 'data.txt';
  32.   outFilename: String = 'sorted.txt';
  33.   routineName: String;
  34.   procedures: TProcedureArray;
  35.   proc: procedure;
  36.  
  37.   procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  38.   var
  39.     InFile: Text;
  40.     I: LongInt;
  41.   begin
  42.     Assign(InFile, inFilename);
  43.     Rewrite(InFile);
  44.     for I := 1 to repeatMillionsCount * 1000000 do
  45.       WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
  46.     Close(InFile);
  47.   end;
  48.  
  49.   procedure SortCountAkira;
  50.   var
  51.     I: LongInt;
  52.     InOut: Text;
  53.     Map: TDictionary<LongInt, LongInt>;
  54.     Pair: TIntPair;
  55.     Pairs: TArray<TIntPair>;
  56.   begin
  57.     routineName := {$I %currentroutine%};
  58.     Map := TDictionary<LongInt, LongInt>.Create();
  59.     //Map.Capacity := 10000000;
  60.     Assign(InOut, inFilename);
  61.     Reset(InOut);
  62.     while not EOF(InOut) do begin
  63.       ReadLn(InOut, I);
  64.       Inc(Total);
  65.       if not Map.ContainsKey(I) then
  66.         begin
  67.           Map.Add(I, 1);
  68.           Inc(Unique);
  69.         end
  70.       else
  71.         Map[I] := Map[I] + 1;
  72.     end;
  73.     Close(InOut);
  74.     Pairs := Map.ToArray();
  75.     Map.Free();
  76.     TArrayHelper<TIntPair>.Sort(
  77.       Pairs,
  78.       TComparer<TIntPair>.Construct(ComparePairs)
  79.     );
  80.     Assign(InOut, outFilename);
  81.     Rewrite(InOut);
  82.     for Pair in Pairs do with Pair do
  83.       WriteLn(InOut, Key, ' - ', Value);
  84.     Close(InOut);
  85.   end;
  86.  
  87.   procedure SortCountHoward;
  88.   var
  89.     arr: array of Integer;
  90.     textf: TextFile;
  91.     min: Integer = High(Integer);
  92.     max: Integer = -1;
  93.     i: Integer;
  94.   begin
  95.     routineName := {$I %currentroutine%};
  96.     AssignFile(textf, inFilename);
  97.     Reset(textf);
  98.     while not EOF(textf) do
  99.       begin
  100.         ReadLn(textf, i);
  101.         Inc(Total);
  102.         if i < min then
  103.           min := i;
  104.         if i > max then
  105.           max := i;
  106.       end;
  107.     SetLength(arr, max-min+1);
  108.  
  109.     Reset(textf);
  110.     while not EOF(textf) do
  111.       begin
  112.         ReadLn(textf, i);
  113.         Dec(i, min);
  114.         Inc(arr[i]);
  115.       end;
  116.     CloseFile(textf);
  117.  
  118.     AssignFile(textf, outFilename);
  119.     Rewrite(textf);
  120.     for i := Low(arr) to High(arr) do
  121.       case (arr[i] > 0) of
  122.         True:
  123.           begin
  124.             WriteLn(textf, i+min, ' - ', arr[i]);
  125.             Inc(Unique);
  126.           end;
  127.       end;
  128.     CloseFile(textf);
  129.     SetLength(arr, 0);
  130.   end;
  131.  
  132.   procedure SortCountAvk1;
  133.   type
  134.     TCounter  = TGHashMultiSetLP<Integer>;
  135.     TCountRef = TGAutoRef<TCounter>;
  136.     TEntry    = TCounter.TEntry;
  137.  
  138.     function EntryCmp(constref L, R: TEntry): SizeInt;
  139.     begin
  140.       if L.Key > R.Key then
  141.         Result := 1
  142.       else
  143.         if L.Key < R.Key then
  144.           Result := -1
  145.         else
  146.           Result := 0;
  147.     end;
  148.  
  149.   var
  150.     CountRef: TCountRef;
  151.     InOut: Text;
  152.     Counter: TCounter;
  153.     e: TEntry;
  154.     I: Integer;
  155.   begin
  156.     routineName := {$I %currentroutine%};
  157.     Counter := CountRef;
  158.     //Counter.LoadFactor := 0.7;
  159.     Assign(InOut, inFilename);
  160.     Reset(InOut);
  161.     while not EOF(InOut) do
  162.       begin
  163.         ReadLn(InOut, I);
  164.         Counter.Add(I);
  165.       end;
  166.     Close(InOut);
  167.     Total := Counter.Count;
  168.     Unique := Counter.EntryCount;
  169.     if Counter.NonEmpty then
  170.       begin
  171.         Assign(InOut, outFilename);
  172.         Rewrite(InOut);
  173.         for e in Counter.Entries.Sorted(EntryCmp) do
  174.           with e do
  175.             WriteLn(InOut, Key, ' - ', Count);
  176.         Close(InOut);
  177.       end;
  178.   end;
  179.  
  180.   procedure SortCountAvk2;
  181.   type
  182.     TIntArray = array of Integer;
  183.  
  184.     function LoadData: TIntArray;
  185.     var
  186.       PCurr, PLast: PByte;
  187.       DataSize, CurrValue, I: Integer;
  188.       DoReading: Boolean = False;
  189.     begin
  190.       Result := nil;
  191.       I := 0;
  192.       with TMemoryStream.Create do
  193.         try
  194.           LoadFromFile(inFileName);
  195.           DataSize := Size;
  196.           if DataSize <= 0 then
  197.             exit;
  198.           SetLength(Result, DataSize div 10);
  199.           PCurr := Memory;
  200.           PLast := PCurr + Size;
  201.           CurrValue := 0;
  202.           repeat
  203.             if PCurr^ > $0D then
  204.               begin
  205.                 DoReading := True;
  206.                 CurrValue := CurrValue * 10 + PCurr^ - Ord('0');
  207.               end
  208.             else
  209.               if DoReading then
  210.                 begin
  211.                   if Length(Result) = I then
  212.                     SetLength(Result, I * 2);
  213.                   Result[I] := CurrValue;
  214.                   Inc(I);
  215.                   CurrValue := 0;
  216.                   DoReading := False;
  217.                 end;
  218.             Inc(PCurr);
  219.           until PCurr > PLast;
  220.         finally
  221.           Free;
  222.         end;
  223.       SetLength(Result, I);
  224.     end;
  225.   var
  226.     List: TIntArray;
  227.     InOut: Text;
  228.     I, J, Count, DupCount: Integer;
  229.   begin
  230.     routineName := {$I %currentroutine%};
  231.     List := LoadData;
  232.     if List = nil then
  233.       exit;
  234.     TGOrdinalArrayHelper<Integer>.Sort(List);
  235.     Total := Length(List);
  236.     Count := Total;
  237.     DupCount := 0;
  238.     I := 0;
  239.     Assign(InOut, outFilename+'1');
  240.     Rewrite(InOut);
  241.     repeat
  242.       J := List[I];
  243.       while (I < Count) and (List[I] = J) do
  244.         begin
  245.           Inc(DupCount);
  246.           Inc(I);
  247.         end;
  248.       WriteLn(InOut, J, ' - ', DupCount);
  249.       Inc(Unique);
  250.       DupCount := 0;
  251.     until I = Count;
  252.     Close(InOut);
  253.   end;
  254.  
  255.   procedure SortCountJulkas1;
  256.   type
  257.     TIntLess = TLess<LongInt>;
  258.     TIntVect = TVector<LongInt>;
  259.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  260.   const
  261.     bsz = 1 shl 17;
  262.   var
  263.     sc: array[0..2147483647 shr 17] of TIntVect;
  264.     i: LongInt;
  265.     pkey, key, cnt: LongInt;
  266.     offset: LongInt;
  267.     InOut: Text;
  268.   begin
  269.     routineName := {$I %currentroutine%};
  270.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  271.  
  272.     Assign(InOut, inFilename);
  273.     Reset(InOut);
  274.     while not EOF(InOut) do
  275.       begin
  276.         ReadLn(InOut, key);
  277.         Inc(Total);
  278.         sc[key shr 17].PushBack(key and $1FFFF);
  279.       end;
  280.     Close(InOut);
  281.  
  282.     Assign(InOut, outFilename);
  283.     Rewrite(InOut);
  284.  
  285.     offset := -bsz;
  286.     for i := Low(sc) to High(sc) do
  287.     begin
  288.       Inc(offset, bsz);
  289.       pkey := -1;
  290.       cnt := 0;
  291.       if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  292.       for key in sc[i] do
  293.       begin
  294.         if pkey <> key then
  295.         begin
  296.           if cnt <> 0 then
  297.           begin
  298.             WriteLn(InOut, offset + pkey, ' - ', cnt);
  299.             Inc(Unique);
  300.           end;
  301.           pkey := key;
  302.           cnt := 0;
  303.         end;
  304.         Inc(cnt);
  305.       end;
  306.       if cnt <> 0 then
  307.       begin
  308.         WriteLn(InOut, offset + pkey, ' - ', cnt);
  309.         Inc(Unique);
  310.       end;
  311.     end;
  312.  
  313.     Close(InOut);
  314.     for i := Low(sc) to High(sc) do sc[i].Destroy;
  315.   end;
  316.  
  317.   procedure SortCountJulkas2;
  318.   type
  319.     TIntLess = TLess<LongInt>;
  320.     TIntVect = TVector<LongInt>;
  321.     TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  322.   var
  323.     sc: array[0..21474] of TIntVect;
  324.     i: LongInt;
  325.     pkey, key, cnt: LongInt;
  326.     offset: LongInt;
  327.     InOut: Text;
  328.   begin
  329.     routineName := {$I %currentroutine%};
  330.     for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  331.  
  332.     Assign(InOut, inFilename);
  333.     Reset(InOut);
  334.     while not EOF(InOut) do
  335.       begin
  336.         ReadLn(InOut, key);
  337.         Inc(Total);
  338.         sc[key div 100000].PushBack(key mod 100000);
  339.       end;
  340.     Close(InOut);
  341.  
  342.     Assign(InOut, outFilename);
  343.     Rewrite(InOut);
  344.  
  345.     for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  346.  
  347.     offset := -100000;
  348.     for i := Low(sc) to High(sc) do
  349.     begin
  350.       Inc(offset, 100000);
  351.       pkey := -1;
  352.       cnt := 0;
  353.       for key in sc[i] do
  354.       begin
  355.         if pkey <> key then
  356.         begin
  357.           if cnt <> 0 then
  358.           begin
  359.             WriteLn(InOut, offset + pkey, ' - ', cnt);
  360.             Inc(Unique);
  361.           end;
  362.           pkey := key;
  363.           cnt := 0;
  364.         end;
  365.         Inc(cnt);
  366.       end;
  367.       if cnt <> 0 then
  368.       begin
  369.         WriteLn(InOut, offset + pkey, ' - ', cnt);
  370.         Inc(Unique);
  371.       end;
  372.     end;
  373.  
  374.     Close(InOut);
  375.     for i := Low(sc) to High(sc) do
  376.       sc[i].Destroy;
  377.   end;
  378.  
  379.   procedure SortCount440bx;
  380.   begin
  381.     routineName := {$I %currentroutine%};
  382.     WinSortCount3.DataCount := 0;
  383.     WinSortCount3.Unique := 0;
  384.     WinSortCount3.InFileName := inFilename;
  385.     WinSortCount3.OutFileName := outFilename;
  386.     WinSortCount3.SortCount;
  387.     Total := WinSortCount3.DataCount;
  388.     Unique := WinSortCount3.Unique;
  389.   end;
  390.  
  391.   procedure SortCountBrunoK;  { Note : requires Classes }
  392.   const
  393.     cCR = $0D;
  394.     c0 = Ord('0');
  395.     function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
  396.     var
  397.       { Parse lines }
  398.       lPByte, lPEndByte: PByte;
  399.  
  400.       { Values extraction }
  401.       lPByteTextStart: PByte = nil;
  402.       lValueStarted: boolean = False;
  403.       lDWORD: DWORD;
  404.  
  405.       lCntRec: integer;
  406.     begin
  407.       { Prepare aList }
  408.       lCntRec := aMemStream.Size;
  409.       if lCntRec <= 0 then // Stream empty ?
  410.         exit(0);
  411.       aList.Count := lCntRec div 10; // Setup approximative size
  412.       aList.Count := 0;
  413.  
  414.       lPByte := PByte(aMemStream.memory);
  415.       lPEndByte := lPByte + aMemStream.Size;
  416.  
  417.       while lPByte <= lPEndByte do begin
  418.         if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
  419.           if lValueStarted then begin
  420.             lDWORD := 0;
  421.             while lPByteTextStart < lPByte do begin
  422.               lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
  423.               Inc(lPByteTextStart);
  424.             end;
  425.             aList.Add(Pointer(lDWORD));
  426.             lValueStarted := False;
  427.           end;
  428.         end
  429.         else if not lValueStarted then begin
  430.           lPByteTextStart := lPByte;
  431.           lValueStarted := True;
  432.         end;
  433.         Inc(lPByte);
  434.       end;
  435.       Result := aList.Count;
  436.     end;
  437.  
  438.     function BkCompare(Item1, Item2: Pointer): integer;
  439.     begin
  440.       Result := 1;
  441.       if Item1 < Item2 then
  442.         Result := -1
  443.       else if Item1 = Item2 then
  444.         Result := 0;
  445.     end;
  446.  
  447.   var
  448.     lFile: TextFile;
  449.     lMemStream: TMemoryStream;
  450.     lNbRecs: integer = 0;
  451.     lFPList: TFPList;
  452.     lIx: integer;
  453.     lLastValue: pointer;
  454.     lListCount: integer;
  455.     lLastValueCount: integer;
  456.   begin
  457.     routineName := 'SortCountBrunoK'; //{$I %currentroutine%};
  458.     lMemStream := TMemoryStream.Create;
  459.     lMemStream.LoadFromFile(inFileName);
  460.     lFPList := TFPList.Create;
  461.     lNbRecs := LoadStreamToList(lMemStream, lFPList);
  462.     lMemStream.Free; // Not needed anymore
  463.     if lNbRecs > 0 then begin
  464.       AssignFile(lFile, outFilename);
  465.       Rewrite(lFile);
  466.       lFPList.Sort(@BkCompare);
  467.       lIx := 0;
  468.       lLastValue := lFPList[lIx];
  469.       lLastValueCount := 1;
  470.       lListCount := lFPList.Count;
  471.       repeat
  472.         Inc(lIx);
  473.         if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
  474.           Inc(unique);
  475.           WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
  476.           if (lIx >= lListCount) then
  477.             Break;
  478.           lLastValue := lFPList[lIx];
  479.           lLastValueCount := 1;
  480.         end
  481.         else
  482.           Inc(lLastValueCount);
  483.       until False;
  484.       CloseFile(lFile);
  485.       Total := lIx;
  486.     end;
  487.     lFPList.Free;
  488.   end;
  489.  
  490.   procedure Run(aProc: TProcedure);
  491.   begin
  492.     Total := 0;
  493.     Unique := 0;
  494.     Start := Now;
  495.     try
  496.       aProc();
  497.       WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,#9'#unique: ',Unique,' #total: ',Total);
  498.     except
  499.       on e: Exception do
  500.         WriteLn('crashes with message "', e.Message, '"');
  501.     end;
  502.   end;
  503.  
  504. begin
  505.   Randomize;
  506.  
  507.   procedures := TProcedureArray.Create(
  508.                    SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
  509.                    SortCountAvk2, SortCount440bx, SortCountBrunoK);
  510.  
  511.   for randomrange := 1 to 10 do
  512.     begin
  513.       GenerateData(randomrange, 10);
  514.       WriteLn(#10'RandomRange = ',randomrange);
  515.       for proc in procedures do
  516.         Run(proc);
  517.     end;
  518.  
  519.   for repeatCount := 1 to 10 do
  520.     begin
  521.       GenerateData(8, 2*repeatCount);
  522.       WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
  523.       for proc in procedures do
  524.         Run(proc);
  525.     end;
  526. end.
  527.  

Results in attachment.
Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 10:00:58 am
Code: [Select]
  Result := 1 - integer( a=b ) - 2*integer( a<b );
At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan
I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.

I did measure with 1 mio random pair entries (which might not be 100% equivalent to the calling patterns of the sort function) and the executon time were 16 mio clock cycles vs. 5.5 mio clock cycles on my system. The branch misprediction penalty on modern CPUs is quite high - in the range of 20-50 clock cycles - and for random entries you'll have around 50% misprediction.

MathMan
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 10:12:46 am
And 32-bit results:
Code: Text  [Select][+][-]
  1.  
  2. RandomRange = 1
  3. Julkas1's time: 3.9780  #unique: 100000 #total: 10000000
  4. Julkas2's time: 3.9630  #unique: 100000 #total: 10000000
  5.   Akira's time: 3.4780  #unique: 100000 #total: 10000000
  6.  Howard's time: 5.1170  #unique: 100000 #total: 10000000
  7.    Avk1's time: 3.1360  #unique: 100000 #total: 10000000
  8.    Avk2's time: 0.4990  #unique: 100000 #total: 10000000
  9.   440bx's time: 2.2930  #unique: 100000 #total: 10000000
  10.  BrunoK's time: 2.2310  #unique: 100000 #total: 10000000
  11.  
  12. RandomRange = 2
  13. Julkas1's time: 3.9460  #unique: 200000 #total: 10000000
  14. Julkas2's time: 3.9630  #unique: 200000 #total: 10000000
  15.   Akira's time: 3.7440  #unique: 200000 #total: 10000000
  16.  Howard's time: 5.1630  #unique: 200000 #total: 10000000
  17.    Avk1's time: 3.3860  #unique: 200000 #total: 10000000
  18.    Avk2's time: 0.5140  #unique: 200000 #total: 10000000
  19.   440bx's time: 2.4340  #unique: 200000 #total: 10000000
  20.  BrunoK's time: 2.2930  #unique: 200000 #total: 10000000
  21.  
  22. RandomRange = 3
  23. Julkas1's time: 3.9780  #unique: 300000 #total: 10000000
  24. Julkas2's time: 3.9780  #unique: 300000 #total: 10000000
  25.   Akira's time: 3.9000  #unique: 300000 #total: 10000000
  26.  Howard's time: 5.2260  #unique: 300000 #total: 10000000
  27.    Avk1's time: 3.5560  #unique: 300000 #total: 10000000
  28.    Avk2's time: 0.5460  #unique: 300000 #total: 10000000
  29.   440bx's time: 2.5430  #unique: 300000 #total: 10000000
  30.  BrunoK's time: 2.3710  #unique: 300000 #total: 10000000
  31.  
  32. RandomRange = 4
  33. Julkas1's time: 4.0090  #unique: 400000 #total: 10000000
  34. Julkas2's time: 4.0250  #unique: 400000 #total: 10000000
  35.   Akira's time: 4.1030  #unique: 400000 #total: 10000000
  36.  Howard's time: 5.3190  #unique: 400000 #total: 10000000
  37.    Avk1's time: 3.6040  #unique: 400000 #total: 10000000
  38.    Avk2's time: 0.5770  #unique: 400000 #total: 10000000
  39.   440bx's time: 2.6050  #unique: 400000 #total: 10000000
  40.  BrunoK's time: 2.4490  #unique: 400000 #total: 10000000
  41.  
  42. RandomRange = 5
  43. Julkas1's time: 4.0560  #unique: 500000 #total: 10000000
  44. Julkas2's time: 4.0410  #unique: 500000 #total: 10000000
  45.   Akira's time: 4.3050  #unique: 500000 #total: 10000000
  46.  Howard's time: 5.4130  #unique: 500000 #total: 10000000
  47.    Avk1's time: 3.6820  #unique: 500000 #total: 10000000
  48.    Avk2's time: 0.6240  #unique: 500000 #total: 10000000
  49.   440bx's time: 2.6520  #unique: 500000 #total: 10000000
  50.  BrunoK's time: 2.4960  #unique: 500000 #total: 10000000
  51.  
  52. RandomRange = 6
  53. Julkas1's time: 4.0870  #unique: 600000 #total: 10000000
  54. Julkas2's time: 4.0560  #unique: 600000 #total: 10000000
  55.   Akira's time: 4.3680  #unique: 600000 #total: 10000000
  56.  Howard's time: 5.4910  #unique: 600000 #total: 10000000
  57.    Avk1's time: 3.7910  #unique: 600000 #total: 10000000
  58.    Avk2's time: 0.6710  #unique: 600000 #total: 10000000
  59.   440bx's time: 2.7300  #unique: 600000 #total: 10000000
  60.  BrunoK's time: 2.5290  #unique: 600000 #total: 10000000
  61.  
  62. RandomRange = 7
  63. Julkas1's time: 4.1340  #unique: 700000 #total: 10000000
  64. Julkas2's time: 4.1030  #unique: 700000 #total: 10000000
  65.   Akira's time: 4.4930  #unique: 700000 #total: 10000000
  66.  Howard's time: 5.5850  #unique: 700000 #total: 10000000
  67.    Avk1's time: 3.8840  #unique: 700000 #total: 10000000
  68.    Avk2's time: 0.7020  #unique: 700000 #total: 10000000
  69.   440bx's time: 2.7770  #unique: 700000 #total: 10000000
  70.  BrunoK's time: 2.5740  #unique: 700000 #total: 10000000
  71.  
  72. RandomRange = 8
  73. Julkas1's time: 4.1340  #unique: 799992 #total: 10000000
  74. Julkas2's time: 4.1500  #unique: 799992 #total: 10000000
  75.   Akira's time: 4.6020  #unique: 799992 #total: 10000000
  76.  Howard's time: 5.6630  #unique: 799992 #total: 10000000
  77.    Avk1's time: 3.9160  #unique: 799992 #total: 10000000
  78.    Avk2's time: 0.7640  #unique: 799992 #total: 10000000
  79.   440bx's time: 2.8240  #unique: 799992 #total: 10000000
  80.  BrunoK's time: 2.6200  #unique: 799992 #total: 10000000
  81.  
  82. RandomRange = 9
  83. Julkas1's time: 4.2430  #unique: 899989 #total: 10000000
  84. Julkas2's time: 4.1970  #unique: 899989 #total: 10000000
  85.   Akira's time: 4.6950  #unique: 899989 #total: 10000000
  86.  Howard's time: 5.6940  #unique: 899989 #total: 10000000
  87.    Avk1's time: 3.9940  #unique: 899989 #total: 10000000
  88.    Avk2's time: 0.7960  #unique: 899989 #total: 10000000
  89.   440bx's time: 2.8860  #unique: 899989 #total: 10000000
  90.  BrunoK's time: 2.6360  #unique: 899989 #total: 10000000
  91.  
  92. RandomRange = 10
  93. Julkas1's time: 4.1970  #unique: 999951 #total: 10000000
  94. Julkas2's time: 4.1960  #unique: 999951 #total: 10000000
  95.   Akira's time: 4.7740  #unique: 999951 #total: 10000000
  96.  Howard's time: 5.7720  #unique: 999951 #total: 10000000
  97.    Avk1's time: 4.0560  #unique: 999951 #total: 10000000
  98.    Avk2's time: 0.8420  #unique: 999951 #total: 10000000
  99.   440bx's time: 2.9490  #unique: 999951 #total: 10000000
  100.  BrunoK's time: 2.6980  #unique: 999951 #total: 10000000
  101.  
  102. repeatMillionsCount = 2
  103. Julkas1's time: 1.0140  #unique: 734214 #total: 2000000
  104. Julkas2's time: 0.9990  #unique: 734214 #total: 2000000
  105.   Akira's time: 1.4660  #unique: 734214 #total: 2000000
  106.  Howard's time: 1.3260  #unique: 734214 #total: 2000000
  107.    Avk1's time: 1.1230  #unique: 734214 #total: 2000000
  108.    Avk2's time: 0.3590  #unique: 734214 #total: 2000000
  109.   440bx's time: 0.6860  #unique: 734214 #total: 2000000
  110.  BrunoK's time: 0.6870  #unique: 734214 #total: 2000000
  111.  
  112. repeatMillionsCount = 4
  113. Julkas1's time: 1.8100  #unique: 794501 #total: 4000000
  114. Julkas2's time: 1.7940  #unique: 794501 #total: 4000000
  115.   Akira's time: 2.3240  #unique: 794501 #total: 4000000
  116.  Howard's time: 2.4180  #unique: 794501 #total: 4000000
  117.    Avk1's time: 1.8410  #unique: 794501 #total: 4000000
  118.    Avk2's time: 0.4680  #unique: 794501 #total: 4000000
  119.   440bx's time: 1.2630  #unique: 794501 #total: 4000000
  120.  BrunoK's time: 1.1550  #unique: 794501 #total: 4000000
  121.  
  122. repeatMillionsCount = 6
  123. Julkas1's time: 2.5900  #unique: 799570 #total: 6000000
  124. Julkas2's time: 2.5580  #unique: 799570 #total: 6000000
  125.   Akira's time: 3.0890  #unique: 799570 #total: 6000000
  126.  Howard's time: 3.4940  #unique: 799570 #total: 6000000
  127.    Avk1's time: 2.5270  #unique: 799570 #total: 6000000
  128.    Avk2's time: 0.5620  #unique: 799570 #total: 6000000
  129.   440bx's time: 1.8100  #unique: 799570 #total: 6000000
  130.  BrunoK's time: 1.6380  #unique: 799570 #total: 6000000
  131.  
  132. repeatMillionsCount = 8
  133. Julkas1's time: 3.3860  #unique: 799965 #total: 8000000
  134. Julkas2's time: 3.3540  #unique: 799965 #total: 8000000
  135.   Akira's time: 3.8530  #unique: 799965 #total: 8000000
  136.  Howard's time: 4.6170  #unique: 799965 #total: 8000000
  137.    Avk1's time: 3.2450  #unique: 799965 #total: 8000000
  138.    Avk2's time: 0.6870  #unique: 799965 #total: 8000000
  139.   440bx's time: 2.3080  #unique: 799965 #total: 8000000
  140.  BrunoK's time: 2.1380  #unique: 799965 #total: 8000000
  141.  
  142. repeatMillionsCount = 10
  143. Julkas1's time: 4.1340  #unique: 799998 #total: 10000000
  144. Julkas2's time: 4.1650  #unique: 799998 #total: 10000000
  145.   Akira's time: 4.6020  #unique: 799998 #total: 10000000
  146.  Howard's time: 5.6630  #unique: 799998 #total: 10000000
  147.    Avk1's time: 3.9150  #unique: 799998 #total: 10000000
  148.    Avk2's time: 0.7650  #unique: 799998 #total: 10000000
  149.   440bx's time: 2.8390  #unique: 799998 #total: 10000000
  150.  BrunoK's time: 2.6050  #unique: 799998 #total: 10000000
  151.  
  152. repeatMillionsCount = 12
  153. Julkas1's time: 4.9290  #unique: 800000 #total: 12000000
  154. Julkas2's time: 4.8990  #unique: 800000 #total: 12000000
  155.   Akira's time: 5.3660  #unique: 800000 #total: 12000000
  156.  Howard's time: 6.7240  #unique: 800000 #total: 12000000
  157.    Avk1's time: 4.8860  #unique: 800000 #total: 12000000
  158.    Avk2's time: 0.9230  #unique: 800000 #total: 12000000
  159.   440bx's time: 3.5090  #unique: 800000 #total: 12000000
  160.  BrunoK's time: 3.1780  #unique: 800000 #total: 12000000
  161.  
  162. repeatMillionsCount = 14
  163. Julkas1's time: 6.0210  #unique: 800000 #total: 14000000
  164. Julkas2's time: 5.7530  #unique: 800000 #total: 14000000
  165.   Akira's time: 6.1250  #unique: 800000 #total: 14000000
  166.  Howard's time: 7.8410  #unique: 800000 #total: 14000000
  167.    Avk1's time: 5.3080  #unique: 800000 #total: 14000000
  168.    Avk2's time: 0.9670  #unique: 800000 #total: 14000000
  169.   440bx's time: 3.8690  #unique: 800000 #total: 14000000
  170.  BrunoK's time: 3.5880  #unique: 800000 #total: 14000000
  171.  
  172. repeatMillionsCount = 16
  173. Julkas1's time: 6.4900  #unique: 800000 #total: 16000000
  174. Julkas2's time: 6.4720  #unique: 800000 #total: 16000000
  175.   Akira's time: 6.9760  #unique: 800000 #total: 16000000
  176.  Howard's time: 8.9740  #unique: 800000 #total: 16000000
  177.    Avk1's time: 6.0190  #unique: 800000 #total: 16000000
  178.    Avk2's time: 1.0460  #unique: 800000 #total: 16000000
  179.   440bx's time: 4.3830  #unique: 800000 #total: 16000000
  180.  BrunoK's time: 4.0730  #unique: 800000 #total: 16000000
  181.  
  182. repeatMillionsCount = 18
  183. Julkas1's time: 8.0010  #unique: 800000 #total: 18000000
  184. Julkas2's time: 7.7100  #unique: 800000 #total: 18000000
  185.   Akira's time: 8.4710  #unique: 800000 #total: 18000000
  186.  Howard's time: 10.9400 #unique: 800000 #total: 18000000
  187.    Avk1's time: 7.3030  #unique: 800000 #total: 18000000
  188.    Avk2's time: 1.1800  #unique: 800000 #total: 18000000
  189.   440bx's time: 4.9460  #unique: 800000 #total: 18000000
  190.  BrunoK's time: 4.7270  #unique: 800000 #total: 18000000
  191.  
  192. repeatMillionsCount = 20
  193. Julkas1's time: 8.2490  #unique: 800000 #total: 20000000
  194. Julkas2's time: 8.3960  #unique: 800000 #total: 20000000
  195.   Akira's time: 8.8410  #unique: 800000 #total: 20000000
  196.  Howard's time: 11.6740 #unique: 800000 #total: 20000000
  197.    Avk1's time: 7.5300  #unique: 800000 #total: 20000000
  198.    Avk2's time: 1.2660  #unique: 800000 #total: 20000000
  199.   440bx's time: 5.5000  #unique: 800000 #total: 20000000
  200.  BrunoK's time: 5.1120  #unique: 800000 #total: 20000000
  201.  
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 10:28:27 am
And this can not be fixed?
It sure can and, Bruno and yourself are using the "fix" I'd have to use, which is, doing string to integer conversion without readln do it for you (which is slow).

I have mixed feelings about using that optimization for this problem.  To keep the code simple and easy to understand, I'd use ntdll's atoi function but, there is no way for calls to atoi to beat an inline implementation that does not even check for overflows.

Both of your algorithms can be made even faster by not using writeln.  I suppose there must be an object (I'm guessing, a TMemoryStream), that would allow writing an entire block of memory (properly formatted beforehand) in one shot instead of a gazillion writeln(s).

IMO, your avk2 algorithm has the best balance between being clean, easy to understand and fast.  That's what a good program is.  To me, that's the winner.



Title: Re: Sorting and Counting
Post by: hnb on August 01, 2019, 10:34:33 am
small update : my assumption and new propositions for rtl-generics was wrong. After tests I can say one : the Akira proposition for rtl-generics is better than my ideas, not much difference but Akira wins. The positive aspect: thanks to this topic I have ideas to update library (not directly related to sorting and counting), but in general with positive influence on performance and new functionalities.
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 12:19:07 pm
...Both of your algorithms can be made even faster by not using writeln.  I suppose there must be an object (I'm guessing, a TMemoryStream), that would allow writing an entire block of memory (properly formatted beforehand) in one shot instead of a gazillion writeln(s)...
IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.

...
Code: [Select]
  Result := 1 - integer( a=b ) - 2*integer( a<b );
...
But why not
Code: Pascal  [Select][+][-]
  1.   Result := Integer(a > b) - Integer(a < b);
  2.  
?
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 12:29:48 pm
IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.
I completely agree with that.  I admit to being curious too and, there are a number of optimizations that come to mind but, it really feels they are completely out of place for what should be (and can be) a very simple program.
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 01:36:53 pm
I almost forgot,
@hnb, I don’t know if you are aware of such a problem:
Code: Pascal  [Select][+][-]
  1. function GenTestArray: specialize TArray<Integer>;
  2. const
  3.   TestSize = 200000;
  4. var
  5.   I, J: Integer;
  6. begin
  7.   SetLength(Result, TestSize);
  8.   for I := 0 to Pred(TestSize div 2) do
  9.     Result[I] := I;
  10.   J := 0;
  11.   for I := TestSize div 2 to High(Result) do
  12.     begin
  13.       Result[I] := J;
  14.       Inc(J);
  15.     end;
  16. end;
  17.  
try sorting this array using TArrayHelper.
Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 02:28:20 pm
...
...
Code: Pascal  [Select][+][-]
  1.   Result := 1 - integer( a=b ) - 2*integer( a<b );
  2.  
...
But why not
Code: Pascal  [Select][+][-]
  1.   Result := Integer(a > b) - Integer(a < b);
  2.  
?

Mainly because I didn't thought of it ;-)
Title: Re: Sorting and Counting
Post by: wp on August 01, 2019, 02:52:11 pm
Code: [Select]
  Result := 1 - integer( a=b ) - 2*integer( a<b );
Why not add logarithms to increase the effect of obfuscation.  ;D

In earnest: If only the sign of the result of the compare function is evaluated by the sort, wouldn't it be sufficient to just subtract the values?
Code: Pascal  [Select][+][-]
  1. function ComparePairs(constref L, R: TIntPair): LongInt;
  2. begin
  3.   Result := L.Key - R.Key;
  4. end;
  5.  

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 03:27:23 pm
@wp

In earnest: If only the sign of the result of the compare function is evaluated by the sort, wouldn't it be sufficient to just subtract the values?
Code: Pascal  [Select][+][-]
  1. function ComparePairs(constref L, R: TIntPair): LongInt;
  2. begin
  3.   Result := L.Key - R.Key;
  4. end;
  5.  
You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.

Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way.  I cannot think of one.
Title: Re: Sorting and Counting
Post by: howardpc on August 01, 2019, 04:03:14 pm
Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way.  I cannot think of one.
I think it is normally done that way.
I've seen that very code (or something almost identical) both in this forum (I believe it was in code from Marco) and in the FPC sources.
Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 04:13:24 pm
You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.

Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way.  I cannot think of one.

Hm - in this case, what about range check errors? The comparison is save, but the subtraction is not, or is it?

But yes, if only the sign is required then simple subtraction should be sufficient. However the compare & cast hands back a ternary state, as did the initial comparison function.
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 04:15:08 pm
I think it is normally done that way.
I've seen that very code (or something almost identical) both in this forum (I believe it was in code from Marco) and in the FPC sources.
I've read a lot of code in various languages and, I think it's the first time I see it done that way, because I'd remember it.  Now that I've seen it, I'm not about to forget it.
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 04:18:20 pm
For this particular benchmark, this will work.
But what happens if, for example, L.Key = 1500000000 and R.Key = -1500000005?
Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 05:06:20 pm
@MathMan

But yes, if only the sign is required then simple subtraction should be sufficient.
For a sort compare function only the sign should matter (provided the sort function doesn't compare against hard coded values, -1, 0, 1, which it definitely shouldn't.)

@avk

But what happens if, for example, L.Key = 1500000000 and R.Key = -1500000005?
Yes, you are right.  Those values would cause an overflow which would incorrectly indicate that L is less than R.

Both you, and MathMan are correct, doing comparisons avoids that problem.


Thank you both, for pointing out that problem (which now seems obvious too.)
Title: Re: Sorting and Counting
Post by: BrunoK on August 01, 2019, 05:31:50 pm
My last word supporting Signed SizeInt values.

Code: Pascal  [Select][+][-]
  1.  
  2.   { TSortCountList }
  3. type
  4.   TSortCountList = class(TFPList)
  5.   public
  6.     procedure QuickSort;
  7.   end;
  8.  
  9. procedure SortCountBrunoK1;  { Note : requires Classes }
  10. const
  11.   cCR = $0D;
  12.   cETX = $03;
  13.   c0 = Ord('0');
  14.   function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
  15.   var
  16.     { Parse lines }
  17.     lPByte, lPEndByte: PByte;
  18.  
  19.     { Values extraction }
  20.     lValueStarted: boolean = False;
  21.     lSizeInt: SizeInt = 0;
  22.     lMulSign: integer = 1;
  23.     lCntRec: integer;
  24.   begin
  25.     { Prepare aList }
  26.     lCntRec := aMemStream.Size;
  27.     if lCntRec <= 0 then // Stream empty ?
  28.       exit(0);
  29.     aList.Count := lCntRec div 10; // Setup approximative size
  30.     aList.Count := 0;
  31.  
  32.     lPByte := PByte(aMemStream.memory);
  33.     lPEndByte := lPByte + aMemStream.Size;
  34.     (lPEndByte -1)^ := cETX;
  35.     while lPByte <= lPEndByte do begin
  36.       if (lPByte^ <= cCR) then begin
  37.         if lValueStarted then begin
  38.           aList.Add(Pointer(lMulSign * lSizeInt));
  39.           lSizeInt := 0;
  40.           lMulSign := 1;
  41.           lValueStarted := False;
  42.         end;
  43.       end
  44.       else begin
  45.         if lPByte^ in [Ord('+'),Ord('-')] then begin
  46.           if lPByte^ = Ord('-') then
  47.             lMulSign := lMulSign * -1;
  48.         end
  49.         else begin
  50.           lValueStarted := True;
  51.           lSizeInt := lSizeInt * 10 + (lPByte^ - c0);
  52.         end;
  53.       end;
  54.       Inc(lPByte);
  55.     end;
  56.     Result := aList.Count;
  57.   end;
  58. var
  59.   lFile: TextFile;
  60.   lMemStream: TMemoryStream;
  61.   lNbRecs: integer = 0;
  62.   lSortCountList: TSortCountList;
  63.   lIx: integer;
  64.   lLastValue: pointer;
  65.   lListCount: integer;
  66.   lLastValueCount: integer;
  67.   lWriteTextLn : shortstring;
  68. begin
  69.   routineName := 'SortCountBrunoK1'; // {$I %currentroutine%};
  70.   lMemStream := TMemoryStream.Create;
  71.   lMemStream.LoadFromFile(inFileName);
  72.   lSortCountList := TSortCountList.Create;
  73.   lNbRecs := LoadStreamToList(lMemStream, lSortCountList);
  74.   lMemStream.Free; // Not needed anymore
  75.   if lNbRecs > 0 then begin
  76.     AssignFile(lFile, outFilename);
  77.     Rewrite(lFile);
  78.     lSortCountList.QuickSort;
  79.     lIx := 0;
  80.     lLastValue := lSortCountList[lIx];
  81.     lLastValueCount := 1;
  82.     lListCount := lSortCountList.Count;
  83.     lSortCountList.Add(nil);
  84.     repeat
  85.       Inc(lIx);
  86.       if (lSortCountList[lIx] <> lLastValue) then begin
  87.         Inc(unique);
  88.         WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
  89.         if (lIx >= lListCount) then
  90.           Break;
  91.         lLastValue := lSortCountList[lIx];
  92.         lLastValueCount := 1;
  93.       end
  94.       else
  95.         Inc(lLastValueCount);
  96.     until False;
  97.     CloseFile(lFile);
  98.     Total := lIx;
  99.   end;
  100.   lSortCountList.Free;
  101. end;
  102.  
  103. { TSortCountList }
  104.  
  105. type
  106.   PSizeIntList = ^TSizeIntList;
  107.   TSizeIntList = array[0..MaxListSize - 1] of SizeInt;
  108.  
  109. procedure TSortCountList.QuickSort;
  110. var
  111.   _list: PSizeIntList;
  112.   procedure _QSort(L, R: integer);
  113.   var
  114.     I, J: integer;
  115.     P, Q: SizeInt;
  116.   begin
  117.     repeat
  118.       I := L;
  119.       J := R;
  120.       P := SizeInt(_list^[(L + R) div 2]);
  121.       repeat
  122.         while SizeInt(_list^[i]) < P do
  123.           I := I + 1;
  124.         while P < SizeInt(_list^[J]) do
  125.           J := J - 1;
  126.         if I <= J then begin
  127.           Q := _list^[I];
  128.           _list^[I] := _list^[J];
  129.           _list^[J] := Q;
  130.           I := I + 1;
  131.           J := J - 1;
  132.         end;
  133.       until I > J;
  134.       // sort the smaller range recursively
  135.       // sort the bigger range via the loop
  136.       // Reasons: memory usage is O(log(n)) instead of O(n) and loop is faster than recursion
  137.       if J - L < R - I then begin
  138.         if L < J then
  139.           _QSort(L, J);
  140.         L := I;
  141.       end
  142.       else begin
  143.         if I < R then
  144.           _QSort(I, R);
  145.         R := J;
  146.       end;
  147.     until L >= R;
  148.   end;
  149. begin
  150.   if not Assigned(List) or (Count < 2) then exit;
  151.   _List := PSizeIntList(List);
  152.   _QSort(0, Count - 1);
  153. end;
  154.  
Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 06:09:12 pm
@BrunoK, do you want the new version to be added to the benchmark and runned?
Title: Re: Sorting and Counting
Post by: mpknap on November 09, 2019, 08:10:04 am
Gentlemen, does anyone know how to write it in Python?
Title: Re: Sorting and Counting
Post by: bytebites on November 09, 2019, 09:46:44 am
Without sorting
Code: Python  [Select][+][-]
  1. from collections import defaultdict
  2. d=defaultdict(int)
  3. with open("infile") as f:
  4.   for s in f:
  5.     k=s.strip(chr(10))
  6.     if k:
  7.       d[k]+=1
  8. with open("outfile","w") as o:
  9.   for (k,v) in d.items():
  10.     print(f'{k} - {v}',file=o)    
Title: Re: Sorting and Counting
Post by: jamie on November 09, 2019, 03:36:14 pm
Ugg , looks too much like BASIC.....
 >:(
Title: Re: Sorting and Counting
Post by: julkas on November 09, 2019, 03:46:39 pm
Ugg , looks too much like BASIC.....
 >:(
BASIC rocks.
BASIC is great.
BASIC forever.
Title: Re: Sorting and Counting
Post by: mpknap on November 09, 2019, 09:05:41 pm
Ugg , looks too much like BASIC.....
 >:(

but it works :)
only sorting necessary for me ....
Title: Re: Sorting and Counting
Post by: 440bx on November 09, 2019, 10:02:56 pm
but it works :)
Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.
Title: Re: Sorting and Counting
Post by: avk on November 10, 2019, 10:18:55 am
You only have to wish, sir. :)
Python code:
Code: Python  [Select][+][-]
  1. from collections import defaultdict
  2. import timeit
  3. total=0
  4. d=defaultdict(int)
  5. stime=timeit.default_timer()
  6. with open("infile") as f:
  7.   for s in f:
  8.     total +=1
  9.     k=s.strip(chr(10))
  10.     if k:
  11.       d[k]+=1
  12. with open("outfile","w") as o:
  13.   for k in sorted(d.keys()):
  14.     print(f'{k} - {d[k]}',file=o)
  15. stime=timeit.default_timer()-stime
  16. print('Time elapsed: ', stime, ', #unique: ', len(d), ', #total: ', total)
  17.  
Output:
Code: Text  [Select][+][-]
  1. Time elapsed:  12.047126958000263 , #unique:  999955 , #total:  10000000
  2.  

Pascal code:
Code: Pascal  [Select][+][-]
  1. program sort_count;
  2. {$mode objfpc}
  3. {$MODESWITCH NESTEDPROCVARS}
  4. uses
  5.   SysUtils, DateUtils,
  6.   LGUtils, LGHashMultiSet, LGHelpers;
  7. procedure SortCount;
  8. type
  9.   TCounter  = specialize TGHashMultiSetLP<Integer>;
  10.   TCountRef = specialize TGAutoRef<TCounter>;
  11.   TEntry    = TCounter.TEntry;
  12.   function EntryCmp(constref L, R: TEntry): SizeInt;
  13.   begin Result := Integer.Compare(L.Key, R.Key); end;
  14. var
  15.   CountRef: TCountRef;
  16.   InOut: Text;
  17.   Counter: TCounter;
  18.   e: TEntry;
  19.   I: Integer;
  20.   stime: TTime;
  21. begin
  22.   Counter := CountRef;
  23.   Assign(InOut, 'infile');
  24.   Reset(InOut);
  25.   stime := Time;
  26.   while not EOF(InOut) do
  27.     begin
  28.       ReadLn(InOut, I);
  29.       Counter.Add(I);
  30.     end;
  31.   Close(InOut);
  32.   if Counter.NonEmpty then
  33.     begin
  34.       Assign(InOut, 'outfile');
  35.       Rewrite(InOut);
  36.       for e in Counter.Entries.Sorted(@EntryCmp) do
  37.         WriteLn(InOut, e.Key, ' - ', e.Count);
  38.       Close(InOut);
  39.     end;
  40.   WriteLn('Time elapsed: ', MillisecondsBetween(Time, stime)/1000:0:4,
  41.           ', #unique: ', Counter.EntryCount, ', #total: ', Counter.Count);  
  42. end;
  43. begin
  44.   SortCount;
  45. end.
  46.  
Output:
Code: Text  [Select][+][-]
  1. Time elapsed: 3.2300, #unique: 999955, #total: 10000000
  2.  
Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 10:38:42 am
but it works :)
Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.
Python sort implementation is exelent. It's based on Team sort algo.
https://stackoverflow.com/questions/1517347/about-pythons-built-in-sort-method
@avk Can you compare only sorting phase time?. (Python default I/O is slow).
Try with PyPy compiler also.
 
Title: Re: Sorting and Counting
Post by: 440bx on November 10, 2019, 10:57:36 am
You only have to wish, sir. :)
Thank you very much Avk. :)
Title: Re: Sorting and Counting
Post by: avk on November 10, 2019, 11:12:46 am
Can you compare only sorting phase time?. (Python default I/O is slow).
Sort phase only:
Python
Code: Text  [Select][+][-]
  1. Time elapsed:  1.1443881560007867 , #unique:  999955 , #total:  10000000
  2.  
Pascal
Code: Text  [Select][+][-]
  1. Time elapsed: 0.1750, #unique: 999955, #total: 10000000
  2.  
Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 11:34:04 am
Can you compare only sorting phase time?. (Python default I/O is slow).
Sort phase only:
Python
Code: Text  [Select][+][-]
  1. Time elapsed:  1.1443881560007867 , #unique:  999955 , #total:  10000000
  2.  
Pascal
Code: Text  [Select][+][-]
  1. Time elapsed: 0.1750, #unique: 999955, #total: 10000000
  2.  
@avk Thanks.
https://ideone.com/9n168K
Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 12:20:34 pm
@avk Thanks.
https://ideone.com/9n168K
Julkas, ideone lags behind in compiler version for FPC and that can make a big difference. You should run the complete test suite on ideone to get any meaningful comparison.
Better run 3.2.0  or test everything with 3.0.4.
Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 06:11:24 pm
We can't compare Python (interpreted Lang) with Pascal.
Anyway, I use Python, I like Python.
Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 06:17:20 pm
We can't compare Python (interpreted Lang) with Pascal.
Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
(You can also use FPC to write and add Python libraries)
And indeed:
I use Python and I like Python... :P :o
Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 07:19:42 pm
We can't compare Python (interpreted Lang) with Pascal.
Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
(You can also use FPC to write and add Python libraries)
And indeed:
I use Python and I like Python... :P :o
PYTHON rocks.
PYTHON is great.
PYTHON forever.
PYTHON νῦν και ἀεὶ.
Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 07:29:25 pm
Well, as it seems it needs some FPC compiled sort libraries,,,,, 8-)
Title: Re: Sorting and Counting
Post by: mpknap on November 10, 2019, 08:11:23 pm
Ultimately, I use this algorithm, it is clear to me and because I have a problem installing the LG package ....
30MB file counted and sorted is 6 minutes;)

I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal  [Select][+][-]
  1. unit mainSortCount;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.    Classes, SysUtils, Forms, StdCtrls;
  9.  
  10. type
  11.    TForm1 = class(TForm)
  12.      Memo1: TMemo;
  13.      procedure FormCreate(Sender: TObject);
  14.    end;
  15.  
  16. var
  17.    Form1: TForm1;
  18.  
  19.   procedure SortCount(const anInFile: String; out aList: TStringList);
  20.  
  21.   procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  22.  
  23. implementation
  24.  
  25. {$R *.lfm}
  26.  
  27. { TForm1 }
  28.  
  29. procedure SortCount(const anInFile: String; out aList: TStringList);
  30. const
  31.   one = PtrUInt(1);
  32. var
  33.   textf: TextFile;
  34.   s: String;
  35.   idx: Integer;
  36.  
  37.   function GetSuccObj(anIntObj: TObject): TObject;
  38.   var
  39.     i: PtrUInt absolute anIntObj;
  40.   begin
  41.     Inc(i);
  42.     Exit(anIntObj);
  43.   end;
  44.  
  45. begin
  46.   Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  47.   aList := TStringList.Create;
  48.   aList.Duplicates := dupError;
  49.   aList.Sorted := True;
  50.   AssignFile(textf, anInFile);
  51.   try
  52.     Reset(textf);
  53.     while not EOF(textf) do
  54.       begin
  55.         ReadLn(textf, s);
  56.         s := Trim(s);
  57.         idx := aList.IndexOf(s);
  58.         case idx of
  59.           -1: aList.AddObject(s, TObject(one));
  60.           else
  61.             aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
  62.         end;
  63.       end;
  64.   finally
  65.     CloseFile(textf);
  66.   end;
  67. end;
  68.  
  69. procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  70. var
  71.   i: Integer;
  72. begin
  73.   if Assigned(aList) and Assigned(aMemo) then
  74.     begin
  75.       aMemo.Clear;
  76.       for i := 0 to aList.Count-1 do
  77.         aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
  78.     end;
  79. end;
  80.  
  81. procedure TForm1.FormCreate(Sender: TObject);
  82. var
  83.   sl: TStringList;
  84. begin
  85.   SortCount('infile.txt', sl);
  86.   try
  87.     ShowListInMemo(sl, Memo1);
  88.     Memo1.Lines.SaveToFile('outfile.txt');
  89.   finally
  90.     sl.Free;
  91.   end;
  92. end;
  93.  
  94. end.


As for Python, that was just curiosity, but thank you.

In general, the problem is a little different. Unixtime is one variable of a certain record. It's about displaying records and their numbers with the same Unixtime. After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)


Well, unless it's interesting for you;)
Title: Re: Sorting and Counting
Post by: 440bx on November 10, 2019, 08:20:42 pm
After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)
Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?

ETA: if Windows-only is acceptable then, another question: are the records in the file fixed length or variable length ? 
Title: Re: Sorting and Counting
Post by: avk on November 11, 2019, 06:13:33 am
... I have a problem installing the LG package ....

What kind of problem?
Title: Re: Sorting and Counting
Post by: Thaddy on November 11, 2019, 08:54:20 am
What kind of install? Just make sure it is in your path.
Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 09:28:59 am
After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)
Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?

ETA: if Windows-only is acceptable then, another question: are the records in the file fixed length or variable length ?

I work in Windows 10.
At the beginning I asked about sorting Unix times alone, because I thought I could do it myself. Therefore, as an example I gave a different TXT file format.
The base is originally TXT in the format:


1570485826087           - UnixTime
0,0                                -lat
0,0                                -lon
Patrycja Maca┼éa         -user Name
16584                           -User Number
ekonomik1                    -Team Name
1581                             -Team Number
                                     - Free line as separator
1570485826087
0,0
0,0
Gargastw├│r
17943
kosmos
1243

1570485909840
41,6548226860975
12,523491502443182
Emanuele Maria Latorre
4125
Divulgazione Libera
829

1570485929612
50,63142735
19,63182841
Przemek
137
Przemek
148

1570485941031
42,30451192
-71,22277853
Asaf
15190
Israel
2610


Sample file in the attachment, the smallest due to forum restrictions.

These are the times of detection received by users' smartphones in the CREDOSCIENCE project.

I am interested in obtaining such information :
- how many detections are in the same second / minute / hour. (I already got it thanks to your help)
- If there are several in the same second / minute / hour, display which users and their other data as number, coordinates, Team name etc ..

As I wrote, this is not my job, I'm not a programmer, it's fun for me in my free time. Thank you for your interest :)


What kind of install? Just make sure it is in your path.
I will try as you write. I installed the Lgenerics.LPK file, there was an error and I gave up.

Title: Re: Sorting and Counting
Post by: 440bx on November 11, 2019, 03:10:10 pm

1570485826087           - UnixTime
0,0                     -lat
0,0                     -lon
Patrycja Maca┼éa        -user Name
16584                   -User Number
ekonomik1               -Team Name
1581                    -Team Number
                        - Free line as separator
1570485826087
0,0
0,0
Gargastw├│r
17943
kosmos
1243

<many more>

1567365497795
0,0
0,0
J.C.K.
6037

1

1567367488139
0,0
0,0
J.C.K.
6037

1

Just to make sure I understand the structure of your file and its records.  It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank.  I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ? 


I am interested in obtaining such information :
- how many detections are in the same second / minute / hour. (I already got it thanks to your help)
Mostly with the help of the other contributors in this thread since my implementation doesn't even use the appropriate collating sequence.

- If there are several in the same second / minute / hour, display which users and their other data as number, coordinates, Team name etc ..
Piece of pie.  I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.

Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 03:42:51 pm


[/font][/size]
Just to make sure I understand the structure of your file and its records.  It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank.  I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ? 

Yes. One record is 7 lines. The eighth line is the separator between records.

 

Piece of pie.  I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.
Thanks .
Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 08:32:44 pm
I will tell you what it is needed for.
Detections in smartphones are the impact of cosmic ray particles on the phone's camera (photons, electrons, muons). They are mainly single, not regular impacts on the camera matrix.
Scientists, creators of the CREDO project, are looking for the so-called showers, i.e. rain, hitting many phones at one time, or a short time interval.

Scientists are also looking for links between detection frequencies and other phenomena, with an earthquake, with medicine, and solar wind. I also want to put these data on top of each other. maybe someday I will compare with the results of scientists, maybe I will be the first, maybe not, but I will definitely not be bored with such tasks :)
I "play" such searches on my own.

Anyone can join the project. Data are widely available. They mainly operate on Python.

Currently, the biggest problem is the elimination of false detections. They are mainly committed by new users or those who want to be better in the ranking. For this you need specialists in image analysis.

I don't want you to think that I'm using you, what I do is on my own, for my own curiosity.
Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 02:23:06 am
I don't want you to think that I'm using you, what I do is on my own, for my own curiosity.
Don't worry about that. 

Detections in smartphones are the impact of cosmic ray particles on the phone's camera (photons, electrons, muons).
<snip>
Scientists are also looking for links between detection frequencies and other phenomena,
Experiments like that have the potential to yield surprising results.  Many scientific discoveries resulted from unexpected side effects.  Some that come to mind are microwave ovens, the scientists were constantly having headaches, that's how they figured out that the microwaves were frying their brains (not kidding... though, saying "frying" is a bit of an exaggeration.)

One my favorite unexpected side effects is related to the smartphones you mention.  You may have noticed that cell phones/smart phones no longer have an antenna that has to be pulled out of the phone in order to get reception (old cell phones used to have an antenna.) The reason there _seems_ to be no antenna goes back all the way to Georg Cantor, a mathematician who studied infinities, his work influenced Benoit Mandelbrot, who defined and studied the behavior of fractals, subsequently it was found, apparently mostly by accident, that fractal antennas had a significantly greater wave reception spectrum (due, in part, to their being partially dimensional.)

Without fractals antennas, smartphones would need about half a dozen different antennas protruding out of them in order to capture the spectrum of waves they need to implement all their nifty functions.

I didn't have any time to work of the program today but... I haven't forgotten.
Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 05:56:56 am
@mpknap, did you mean something like that?
Title: Re: Sorting and Counting
Post by: mpknap on November 12, 2019, 07:09:00 am
@mpknap, did you mean something like that?
that's exactly how I would see it. I have to test. I want to transfer this data to Google map.
Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 07:27:12 am
What is essential:
Your test file contained a BOM, I just deleted it manually. Of course, you can load the data file in another way.
The project uses TVirtualStringTree, you must make sure that you have the virtualtreeview package installed.
And of course, the project uses LGenerics. :)
Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 04:45:14 pm
@mpknap

Avk's program is about as good as it gets.  I guess you're all set.

@avk

Nice!  thank you.
Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 05:50:14 pm
In fact, everything is not as good as you say. I just accidentally discovered that the proposed solution cannot display the contents of the last node of the treeview. So I propose a fixed solution.
Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 06:23:44 pm
In fact, everything is not as good as you say.
but, you proved again that errare humanum est. ;)
Title: Re: Sorting and Counting
Post by: mpknap on November 13, 2019, 06:20:20 am
I installed LCLEXTENSION and VIRTUALTREEVIEW Package, but when installing LGeneric I have an error,
Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 09:59:04 am
According to your screenshot, it is impossible to determine the version of the compiler used, but I suspect it is 3.0.4.
Quote from the LGenerics readme:
Quote
...In order to use it (FPC 3.3.1 and higher and Lazarus 1.9.0 and higher)...
If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.
Title: Re: Sorting and Counting
Post by: mpknap on November 13, 2019, 06:31:29 pm
Quote
If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.
If you could, it would be nice. This will be open code, and I don't want to oblige other interested parties to install Packages. It's supposed to be the easiest, not necessarily fast.
Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 07:23:40 pm
Ok, basically we only need sorting and binary search algorithms.
A fairly good sorting algorithm is available in fcl-stl. But I had to write BinarySearch from scratch,
I hope that it will work correctly. Let me know if anything goes wrong.

Upd. I forgot to remove the dependency on LGenerics from the project. :-[
I replaced the attachment.
Title: Re: Sorting and Counting
Post by: Thaddy on November 13, 2019, 08:05:38 pm
@avk there's a good search at http://www.martincharvey.net I even have a fpc adaptation if you are interested. But you can also add {$mode delphi} ... :D ;)
I am referring to his binarytree.pas and indexedstore.pas. The latter is the conceptually most interesting.
Title: Re: Sorting and Counting
Post by: howardpc on November 13, 2019, 08:21:10 pm
@avk
If you intended your second sort-count.zip to avoid dependency on LGenerics, did you upload the wrong file?
Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 08:31:03 pm
@Thaddy, thank you, interesting.
@ howardpc, no, I forgot to remove the dependency, thank you very much.
Title: Re: Sorting and Counting
Post by: 440bx on November 13, 2019, 09:22:19 pm
Ok, basically we only need sorting and binary search algorithms.
A fairly good sorting algorithm is available in fcl-stl. But I had to write BinarySearch from scratch,
I hope that it will work correctly. Let me know if anything goes wrong.
Just in case you may be interested and, in addition to what Thaddy above mentioned, under Windows, ntdll provides a typical bsearch function to search a sorted sequence. 

I use the following definitions:
Code: Pascal  [Select][+][-]
  1. // -----------------------------------------------------------------------------
  2. // qsort and bsearch related types
  3.  
  4. type
  5.   TCompareFunction = function (key : pointer; data : pointer) : ptrint; cdecl;
  6.  
  7. const
  8.   COMPARE_EQUAL   =  0;
  9.   COMPARE_GREATER =  1;
  10.   COMPARE_LESS    = -1;
  11.  
  12.  
  13. function bsearch(key             : pointer;
  14.                  base            : pointer;
  15.                  num             : ptruint;
  16.                  width           : ptruint;
  17.                  CompareFunction : TCompareFunction) : pointer;
  18.   cdecl; external ntdll;
  19.   // ; void *__cdecl bsearch(const void *Key,
  20.   //                         const void *Base,
  21.   //                             size_t NumOfElements,
  22.   //                             size_t SizeOfElements,
  23.   //                               int (__cdecl *PtFuncCompare)(const void *,
  24.   //                                                            const void *))
  25.  
  26. procedure qsort(base            : pointer;
  27.                 num             : ptruint;
  28.                 width           : ptruint;
  29.                 CompareFunction : TCompareFunction);
  30.   cdecl; external ntdll;
  31.   // ; void __cdecl qsort(void  *Base,
  32.   //                      size_t NumOfElements,
  33.   //                      size_t SizeOfElements,
  34.   //                         int (__cdecl *PtFuncCompare)(const void *,
  35.   //                                                      const void *))
  36.  
  37.  
  38.  
so far, it seems to be bug free ;)
Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 05:02:06 am
@440bx, thanks. However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.
Title: Re: Sorting and Counting
Post by: mpknap on November 14, 2019, 07:44:47 am
AVK, I'm sorry but I still have problem with starting sort_count.lpk. He refers to LGeneric, and he can't find RTTI. I even installed the latest version of Lazarus 2.0.6 FPC 3.0.4. for windows 10/64.

Thank you anyway.
Title: Re: Sorting and Counting
Post by: Thaddy on November 14, 2019, 07:48:23 am
AVK, I'm sorry but I still have problem with starting sort_count.lpk. He refers to LGeneric, and he can't find RTTI. I even installed the latest version of Lazarus 2.0.6 FPC 3.0.4. for windows 10/64.

Thank you anyway.
lgenerics needs 3.2.0 or trunk. Not 3.0.4. avk already wrote that.
The rtti unit is also introduced in 3.2.0. See https://wiki.freepascal.org/FPC_New_Features_3.2#Rtti_unit
I suggest you install 3.2.0. It is stable, feature complete  and for major platforms there are ready builds.
And fpcdeluxe can build and install fpc3.2.0+Lazarus 2.0.6 for you.
Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 08:31:18 am
@mpknap, please test this version.
Title: Re: Sorting and Counting
Post by: Thaddy on November 14, 2019, 10:32:58 am
@mpknap, please test this version.
Back-porting just before a major release?  :D :D :D :D
But lgenerics is excellent and a good addition, even replacement, to rtl-generics because it covers a wider scope.
It is not for beginners, though, but you know that.
Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 11:31:44 am
However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.
Yes, that is definitely an important difference in this case. In such a case, when using bsearch, the resulting index when a match is found, has to be "manually adjusted" to ensure it is the index of the first instance match.



Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 12:24:39 pm
And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?
Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 01:15:19 pm
And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?
It normally wouldn't be O(n), it would be (lg N) + (avg_dups_per_key/2).

Where avg_dups is the average number of duplicates per distinct element in the table.  Obviously, if that average is large for a large number of unixtimes (in this case) then it will likely be better from a performance viewpoint to create an index of distinct keys (unixtime) and search that index instead.  IOW, as the ratio of N/distinct gets larger, performance suffers.  In the worst case, for 1 element duplicated N times, then it would be O(N).


Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 02:20:05 pm
...In the worst case, for 1 element duplicated N times, then it would be O(N).
yes it is, and for N/2, N/4, ... duplicated elements.
Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 03:31:21 pm
...In the worst case, for 1 element duplicated N times, then it would be O(N).
yes it is, and for N/2, N/4, ... duplicated elements.
Yes but, it depends on how the duplicates are distributed.  For instance, consider N elements and all the duplicates are clustered in, just for example, 4 elements.  Accessing any of those 4 elements is basically O(n), while accessing any other element is O(lg N).  For such a distribution (granted, it is an unusual one), the big O of the totality is neither O(n) nor O(lg N), for this example it would be (2n + (n - 4) * lg n)

In the specific case of tracking times across the world for what may be a fairly common event, the safe approach is to create an index of distinct unixtimes and bsearch that.   No undesirable cases that way.

Anyway, your point that the start of the duplicate list isn't returned by a normal bsearch is definitely valid.  I just wanted to point out that, if the duplicates are evenly distributed across the n elements _and_ the ratio of N/duplicates isn't very large (say under 10) then, giving the traditional bsearch "a hand" to find the start of the list is reasonable and, ntdll provides one. :)




Title: Re: Sorting and Counting
Post by: mpknap on November 14, 2019, 08:37:57 pm
@mpknap, please test this version.
Yes Yes Yes!!! It works! And that was what I meant, this kind of sorting and displaying information.

Although there is a small problem, pressing F12 to enter the form shows (attachment 1 fot1.jpg), and after attempting installation shows attachment fot2.jpg.

In any case, the algorithm is ok. At the weekend I will check it.
Thank you once again.
Title: Re: Sorting and Counting
Post by: mpknap on November 15, 2019, 07:47:30 am
I don't understand why you can't install VirtualtreeView, I try in different ways and every time different errors. Even in the latest version of lazarus, OnlinePackageManager does not help.

I can't edit the form without it.

AVK, you can't write it in standard Lazarus packages? ;)

If not, I give up ...
Title: Re: Sorting and Counting
Post by: avk on November 15, 2019, 08:15:11 am
I am glad to see some progress in your efforts, but I do not understand your failure to install VirtualTreeview. I installed VirtualTreeview from /lazarus/components/virtualtreeview. Compilation and installation are performed without any errors.
Title: Re: Sorting and Counting
Post by: mpknap on November 16, 2019, 08:17:02 am
finally, I was able to install. I had two other versions of Lazarus on the disk, I deleted them badly.
everything works, thank you again!
Title: Re: Sorting and Counting
Post by: mpknap on July 01, 2020, 07:38:51 am
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal  [Select][+][-]
  1. unit mainSortCount;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.    Classes, SysUtils, Forms, StdCtrls;
  9.  
  10. type
  11.    TForm1 = class(TForm)
  12.      Memo1: TMemo;
  13.      procedure FormCreate(Sender: TObject);
  14.    end;
  15.  
  16. var
  17.    Form1: TForm1;
  18.  
  19.   procedure SortCount(const anInFile: String; out aList: TStringList);
  20.  
  21.   procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  22.  
  23. implementation
  24.  
  25. {$R *.lfm}
  26.  
  27. { TForm1 }
  28.  
  29. procedure SortCount(const anInFile: String; out aList: TStringList);
  30. const
  31.   one = PtrUInt(1);
  32. var
  33.   textf: TextFile;
  34.   s: String;
  35.   idx: Integer;
  36.  
  37.   function GetSuccObj(anIntObj: TObject): TObject;
  38.   var
  39.     i: PtrUInt absolute anIntObj;
  40.   begin
  41.     Inc(i);
  42.     Exit(anIntObj);
  43.   end;
  44.  
  45. begin
  46.   Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  47.   aList := TStringList.Create;
  48.   aList.Duplicates := dupError;
  49.   aList.Sorted := True;
  50.   AssignFile(textf, anInFile);
  51.   try
  52.     Reset(textf);
  53.     while not EOF(textf) do
  54.       begin
  55.         ReadLn(textf, s);
  56.         s := Trim(s);
  57.         idx := aList.IndexOf(s);
  58.         case idx of
  59.           -1: aList.AddObject(s, TObject(one));
  60.           else
  61.             aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
  62.         end;
  63.       end;
  64.   finally
  65.     CloseFile(textf);
  66.   end;
  67. end;
  68.  
  69. procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  70. var
  71.   i: Integer;
  72. begin
  73.   if Assigned(aList) and Assigned(aMemo) then
  74.     begin
  75.       aMemo.Clear;
  76.       for i := 0 to aList.Count-1 do
  77.         aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
  78.     end;
  79. end;
  80.  
  81. procedure TForm1.FormCreate(Sender: TObject);
  82. var
  83.   sl: TStringList;
  84. begin
  85.   SortCount('infile.txt', sl);
  86.   try
  87.     ShowListInMemo(sl, Memo1);
  88.     Memo1.Lines.SaveToFile('outfile.txt');
  89.   finally
  90.     sl.Free;
  91.   end;
  92. end;
  93.  
  94. end.

howardpc  question to you ;)

I can't develop your code for the next need.

I want to count duplicates (or multi x3,x4...) from a CSV file according to DATETIME.

As a result (memo1) I want: DateTIME; Count; all Device_ID participating in the duplicate

Format of CSV file :


user_id,device_id,"datetime"
18817,13174,2020-01-09 00:01:14
15190,10604,2020-01-09 00:09:04
15190,10604,2020-01-09 00:09:05
10892,7559,2020-01-04 10:02:21
10892,7559,2020-01-04 10:52:59
10892,7559,2020-01-04 13:56:42
10892,7559,2020-01-04 20:46:01
15190,10604,2020-01-09 00:13:48
15190,10604,2020-01-09 00:13:48
6521,4879,2020-01-09 00:14:53
Title: Re: Sorting and Counting
Post by: howardpc on July 01, 2020, 11:03:18 am
Try the attached project.
Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 07:01:29 am
Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar
Title: Re: Sorting and Counting
Post by: howardpc on July 02, 2020, 10:15:13 am
Works with your CSV file. With my (larger file) it doesn't finish.
I'm not in the least bit surprised it fails on your "larger" file. Your file is 3.4 MB!
The code I offered was written in about an hour, and completely untested except on a single file of size 377 bytes. Your post never indicated that you were trying to analyse data files of several MB.

You can't expect code to scale to encompass data a million times bigger than it has been tested on without needing adjustment.
I don't know in the code I offered if the limitation you encounter is to do with memory or a TStringlist's natural limits (e.g. the Count property is an Integer, not an int64). Nor does it really matter. TStringList is not suited for processing such large scale data. You would be better advised to use a proper database, one designed to cope with with large datasets, rather than relying on a roll-your-own database based on TStringList.

The code I offered assumes it can load all data into memory at one go. For large datasets you need some way to cache the data, and process it in manageable chunks. You can't try to swallow it whole all at once.
In other words your data requires a different algorithm, one designed with the scale of the data in mind, not one simply designed on the basis of the data format.
Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 01:24:56 pm
of your last example is still using the TMEMO then maybe its not able to hold the list. It is of course not intended for large list.
Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 01:27:14 pm
Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar

Please provide a "ZIP" file, I don't process RAR files ..

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 01:33:29 pm
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.
Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 01:51:53 pm
If you really only want to count duplicate dates, you might want to add the string to the sorted TStringList with date first. After that you can loop the list, but only check the first 19 characters of the string. DON't work with CommaText etc... it's too slow.

In that case you get something like this:
Note the String.Split(',') which is much much faster.
There is no need for a separate unique date stringlist because we already sorted the list when loading. And because date is now the first entry it is sorted on date (for easy checking). You do need to set Duplicates to dupAccept and Sorted to true then.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. begin
  3.   FOriginalCSV := TStringList.Create;
  4.   try
  5.     FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv');
  6.     CollectDuplicateDates(FOriginalCSV, DuplicatesMemo);
  7.   finally
  8.     FOriginalCSV.Free;
  9.   end;
  10. end;
  11.  
  12.  
  13. procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo);
  14. var
  15.   i, dups: integer;
  16.   dups_string: String;
  17.   deviceids: TStringList;
  18.   A: array of string;
  19. begin
  20.   deviceids := TStringList.Create;
  21.   try
  22.  
  23.     deviceids.Duplicates := dupAccept;
  24.     deviceids.Sorted := True;
  25.  
  26.     // make a sorted list with DATE+TIME as first entry
  27.     for i := 0 to Pred(aCSVList.Count) do
  28.     begin
  29.       A := aCSVList[i].Split(',');
  30.       if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]);
  31.     end;
  32.  
  33.     dups := 0;
  34.     dups_string := '';
  35.     for i := 1 to Pred(deviceids.Count) do
  36.     begin
  37.       // ONLY check date+time entry, first 19 characters
  38.       if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then
  39.       begin
  40.         Inc(dups);
  41.         if dups = 1 then dups_string := deviceids[i - 1];
  42.         dups_string := dups_string + ' // ' + deviceids[i];
  43.       end
  44.       else
  45.       begin
  46.         A := deviceids[i].Split(',');
  47.         if dups > 0 then
  48.           aMemo.Lines.Add('"%s"  count=%d   device ids: "%s"', [A[0], dups + 1, dups_string]);
  49.         dups := 0;
  50.         dups_string := '';
  51.       end;
  52.     end;
  53.  
  54.   finally
  55.     deviceids.Free;
  56.   end;
  57. end;

Title: Re: Sorting and Counting
Post by: howardpc on July 02, 2020, 04:56:26 pm
With rvk's more intelligent algorithm it looks like you can still use TStringList to good effect on multi-MB files within a reasonable time frame.
Time spent on developing a good design before writing a line of code is  time well spent, and can lead to cleaner and faster-performing code.
I usually find it easier to adapt/improve other people's code than come up with a winner myself first time. Perhaps this is because I'm just an autodidact  hobbyist with no formal training in IT, and I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.
Title: Re: Sorting and Counting
Post by: lucamar on July 02, 2020, 05:26:58 pm
[..] I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.

You've got plenty company in that, and not only of "autodidact  hobbyist"s. Professionals in a hurry tend to go for that too (and I talk from personal experience :-[)
Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 06:33:36 pm
@rvk, the loop is one more than it should be  8-)

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 06:45:08 pm
@rvk, the loop is one more than it should be  8-)
In my example or in the original in the .rar?

Of course this could be done even more efficiëntly but I just reacted at the the given code in the .rar (where there is a loop within a loop). And there a TStringList was used. So I build on that.
Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:01:49 pm
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.

I used Writeln (i);
The program works. But it's slow.
 
Code: Pascal  [Select][+][-]
  1.  writeln (aUniqueDates.Count);
show over 90,000.

loop  :

Code: Pascal  [Select][+][-]
  1. for i := 0 to Pred(aUniqueDates.Count) do
......

 does it for over a one second for each step.

That's why it lasts so long. 90,000 x 1 second;)
Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 09:04:38 pm
That's why it lasts so long. 90,000 x 1 second;)
Look a few posts back. I showed you code which does it in less then 5 seconds.
Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:10:20 pm
If you really only want to count duplicate dates,

Yes. I'm only interested in duplets, triplets and more ....

More precisely, it's about finding triplets and looking for a second (similar) triplet for the same DEVICE_ID in close proximity, e.g. 5 minutes.
It's complicated to describe because there are no exact directions. I run blind research ;)
Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 09:33:26 pm
uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:
Code: SQL  [Select][+][-]
  1. .mode csv
  2. .import pewniacy_odstycznia_true.csv dupdup
  3. .schema dupdup
  4. CREATE TABLE dupdup(
  5.   "user_id" TEXT,
  6.   "device_id" TEXT,
  7.   "datetime" TEXT
  8. );
  9. SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
  10.  
Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)
Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:33:36 pm
That's why it lasts so long. 90,000 x 1 second;)
Look a few posts back. I showed you code which does it in less then 5 seconds.

thx. really fast  :)
Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:47:46 pm
uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:
Code: SQL  [Select][+][-]
  1. .mode csv
  2. .import pewniacy_odstycznia_true.csv dupdup
  3. .schema dupdup
  4. CREATE TABLE dupdup(
  5.   "user_id" TEXT,
  6.   "device_id" TEXT,
  7.   "datetime" TEXT
  8. );
  9. SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
  10.  
Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)

I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.

If you can do it, it will be great :)

Yes Sqlite ist faster then all ;)
Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 09:53:08 pm
I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.
If you are seriously thinking of using the data as sql dataset in your program, then i can try set it up here at my end.

The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"

The output from the statement as in my previous post displays the deviceid. Do youo mean the event can happen on the exact same date-time but using another device-id ? e.g. you need a (additional) distinction between device-id's ?
Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 10:09:24 pm
Yes Sqlite ist faster then all ;)
You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.

In my example I used TStringList sorted = true and added all the lines to get a sorted list. After that I did a second loop to detect the duplicates. Both steps could be pulled together to make it even faster.

But I like the SQLite approach too. You can create an SQL statement to get the duplicate devices. You can even expand it to give duplicates in 5 minutes of each other like you mentioned (although that will become a somewhat advanced sql :) ).

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.
Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 10:58:16 pm
You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.
In howardpc's and your defence (probably others as well, I haven't read the whole thread), initially you people had to work with the sample data (which in hindsight wasn't a good representation of the actual situation)

@mpknap
And of course there are the exceptions. If you are going to convert a decades old database that needs an upgrade, do you prefer speed over consistency ? I don't and will choose the slower more precise solution over any speedy one that might be sloppy. Time is usually of no concern in such cases (and if there is then those that put on the time-restriction can go play with themselves, as they had decades to think about their sloppy maintenance) (*)

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.
Exactly.

@mpknap
Especially if it concerns a project that is actually a little over your head (for whatever reason). I usually look at the bigger picture first, write down that flow of the program and then try to split up the big(ger) chunks of the program in parts that I am still able to take on, again on paper. In case I need to use techniques (or topics) that I have never dealt with before, I can check those first in a (small) test-program before incorporating such parts into the final program. It's a balance between keeping the bigger picture in mind while at the same time working on detailed implementations. The more work you do beforehand (on paper) the easier it becomes to actually implement the code.

(*) Therefor, also in relation to what I wrote directly above, it is also important to think about things like speed beforehand. Sometimes it matters, sometimes it won't. More is not always better  :D
Title: Re: Sorting and Counting
Post by: mpknap on July 03, 2020, 07:36:42 am
 
That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.

I understand everything you want to advise me. And thank you.

The point is that nobody knows what we're looking for and how to find it. These are just my guesses.
This can be compared to "looking for rain drops that will fall on three flowers in a large garden at the same time. If this event occurs again within a few minutes (at least once), there is a suspicion of success."

It's about searching for cascades of cosmic rays.
I conduct research on my own, in my free time. I'm not a programmer, engineer or scientist that's why so much chaos in my questions;)

And my progress is only thanks to you and your knowledge.

Quote
The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"
TRON.
If the value in the DUP column is, for example, "2", we still do not know which Device_ID make up it. There would have to be an additional column in which it will show Devices numbers ... See screen in the attachment
 
Title: Re: Sorting and Counting
Post by: 440bx on July 03, 2020, 08:03:42 am
The point is that nobody knows what we're looking for and how to find it.
Just a very general comment.  When you believe there might be something to be found in the data but don't even know what, that's when SQL databases are great.  (not the only thing they are excellent for but, that's one of them.)

SQL allows you to "play" with the data with little effort.  Of course, you'll get the most of out of it by acquiring a fairly decent level of knowledge in SQL.  Fortunately, SQL is quite easy and there are a lot of forums with helpful users willing to help when you hit a brick wall.

Personally, I like Postgres but, when it comes to user support, some of the users that participate in the Oracle SQL forum are literally incredible.  Both are extremely capable DBMS systems and they'll allow you to look at data just about every which way you want to look at it, in just a few lines of SQL.

In long winded way, what I'm saying is that I probably wouldn't use Pascal for what you're doing.  I'd use something that requires less work to try random things on the fly.


Title: Re: Sorting and Counting
Post by: TRon on July 03, 2020, 08:05:02 am
The point is that nobody knows what we're looking for and how to find it. These are just my guesses.
I have no idea what your program is suppose to be doing as a final result or how this should be presented to the user, since you are the programmer you are the one in control. So, yes unless you do not know what you wish to achieve/accomplish in the end then we do not know either  ;)

Quote
This can be compared to "looking for rain drops that will fall on three flowers in a large garden at the same time. If this event occurs again within a few minutes (at least once), there is a suspicion of success."
The answer to that question is 42 btw.

Quote
If the value in the DUP column is, for example, "2", we still do not know which Device_ID make up it. There would have to be an additional column in which it will show Devices numbers ... See screen in the attachment
Yeah, and that is impossible to realise because the dupcount is/can be made up of multiple Device_ID's.

Come to think about it, why do you need a duplicate-count to begin with ? imho it isn't helpful at all to know the number of duplicates, unless you have a specific purpose for it ? (which is currently unknown to us, or at least to me).

afaik this is how you manage to create a list of unique duplicates (assuming data is the name of the SQL table):
Code: SQL  [Select][+][-]
  1. SELECT DISTINCT t1.datetime, t1.device_id FROM DATA AS t1 INNER JOIN DATA AS t2 ON t1.datetime = t2.datetime WHERE t1.device_id <> t2.device_id ORDER BY t1.datetime, CAST(t1.device_id AS INTEGER);
  2.  

And, again afaik, this creates a list of the duplicate items.
Code: SQL  [Select][+][-]
  1. SELECT t1.datetime, t1.device_id, t1.user_id FROM DATA AS t1 INNER JOIN (SELECT datetime, device_id, COUNT(*) AS dupcount FROM DATA GROUP BY datetime, device_id HAVING dupcount > 1) AS t2 ON t1.datetime = t2.datetime AND t1.device_id = t2.device_id ORDER BY t1.datetime, t1.device_id;
  2.  

Both show the device_id's that have a duplicate datetime field.

edit: and that picture ... is exactly the kind of cascading that makes those rays act in chaos.... you will never see those figures, at least not by the provided raindrops  ;D
Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 04:36:52 pm

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. begin
  3.   FOriginalCSV := TStringList.Create;
  4.   try
  5.     FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv');
  6.     CollectDuplicateDates(FOriginalCSV, DuplicatesMemo);
  7.   finally
  8.     FOriginalCSV.Free;
  9.   end;
  10. end;
  11.  
  12.  
  13. procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo);
  14. var
  15.   i, dups: integer;
  16.   dups_string: String;
  17.   deviceids: TStringList;
  18.   A: array of string;
  19. begin
  20.   deviceids := TStringList.Create;
  21.   try
  22.  
  23.     deviceids.Duplicates := dupAccept;
  24.     deviceids.Sorted := True;
  25.  
  26.     // make a sorted list with DATE+TIME as first entry
  27.     for i := 0 to Pred(aCSVList.Count) do
  28.     begin
  29.       A := aCSVList[i].Split(',');
  30.       if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]);
  31.     end;
  32.  
  33.     dups := 0;
  34.     dups_string := '';
  35.     for i := 1 to Pred(deviceids.Count) do
  36.     begin
  37.       // ONLY check date+time entry, first 19 characters
  38.       if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then
  39.       begin
  40.         Inc(dups);
  41.         if dups = 1 then dups_string := deviceids[i - 1];
  42.         dups_string := dups_string + ' // ' + deviceids[i];
  43.       end
  44.       else
  45.       begin
  46.         A := deviceids[i].Split(',');
  47.         if dups > 0 then
  48.           aMemo.Lines.Add('"%s"  count=%d   device ids: "%s"', [A[0], dups + 1, dups_string]);
  49.         dups := 0;
  50.         dups_string := '';
  51.       end;
  52.     end;
  53.  
  54.   finally
  55.     deviceids.Free;
  56.   end;
  57. end;


RVK.
I need one more condition in your code.

I want them to be displayed in Memo, only records where DeviceID are not the same.
If there are 3 dup for DateTime and DeviceID are 3 times the same then we reject it.
I try to do it myself but it doesn't work out.

This will bring me closer to finding "raindrops";)
Title: Re: Sorting and Counting
Post by: rvk on July 05, 2020, 05:48:42 pm
I need one more condition in your code.

I want them to be displayed in Memo, only records where DeviceID are not the same.
If there are 3 dup for DateTime and DeviceID are 3 times the same then we reject it.
I try to do it myself but it doesn't work out.

This will bring me closer to finding "raindrops";)
Is you user-id always the same as device-id on the same date+time?
In that case my initial thought was correct and you can just match the entire line during sorting. Set duplicates to dupIgnore and same user,device,datetimes are ignored.

So this should be sufficient
Code: Pascal  [Select][+][-]
  1. deviceids.Duplicates := dupIgnore;
Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 07:06:01 pm

Is you user-id always the same as device-id on the same date+time?
In that case my initial thought was correct and you can just match the entire line during sorting. Set duplicates to dupIgnore and same user,device,datetimes are ignored.
 

Yes . UserID and DeciceID are the same and repeat themselves.

UserID is the user number.
DeviceID is the smartphone number for the user. Users can have multiple devices.

A flower in the garden is just a smartphone.
And the raindrop is the detection of cosmic radiation in the phone.

All data comes from the CREDO IFJ Poland project. :)

I am looking for whether two phones will catch radiation detection in the same second, and whether they will repeat themselves in a short interval of time for the same devices. If they are found, there is suspicion of "Air Shower"

https://en.wikipedia.org/wiki/Air_shower_(physics)
 8)


Quote
So this should be sufficient
Code: Pascal  [Select][+][-]
  1. deviceids.Duplicates := dupIgnore;

Its work :)
Title: Re: Sorting and Counting
Post by: TRon on July 05, 2020, 08:39:40 pm
Its work :)
So does
Code: SQL  [Select][+][-]
  1. SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
It still doesn't mean that picture of yours is reproducible or representable for your data ...  ;)
Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 08:51:06 pm
Its work :)
So does
Code: SQL  [Select][+][-]
  1. SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
It still doesn't mean that picture of yours is reproducible or representable for your data ...  ;)

You are genius!!!! You don't even know how much I was looking for! Revelation :)
Thx!!!
Title: Re: Sorting and Counting
Post by: TRon on July 05, 2020, 09:47:38 pm
You are genius!!!!

Albert Einstein was a genius, so was Stephen Hawking. I even consider those that work on compilers such as Free Pascal, or an IDE as Lazurus genius. I am merely someone ploughing my way through boring documentation and attempt to apply what I've just read  :)

The sad part about it really is that I've been sitting on that since I have edited my reply at #197 but, wasn't able to share because you were still at an intermediate step/position in your quest for an answer (one wrong turn heading towards your destination w/could have rendered that SQL line completely useless, in which case I would have had to read even more boring documentation :D ).

Quote
You don't even know how much I was looking for! Revelation :)
I'm pleased to learn that it is useful for you.

I wish you much happy raindrops and balanced cosmic rays !
Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 10:47:52 pm
 
[/quote]

Albert Einstein was a genius, so was Stephen Hawking.
[/quote]
iam Bob. Bob the Builder ;) Simple worker. I love Pascal :)
Title: Re: Sorting and Counting
Post by: mpknap on July 11, 2020, 08:27:32 pm
Its work :)
So does
Code: SQL  [Select][+][-]
  1. SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
It still doesn't mean that picture of yours is reproducible or representable for your data ...  ;)

Welcome back. ;)
Ultimately, I'm using the code in this form:
Code: MySQL  [Select][+][-]
  1. [code=mysql]SELECT  datetime(timestamp/1000,'unixepoch') as czas,  
  2. COUNT(datetime(timestamp/1000,'unixepoch')) AS dupes,
  3. GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
  4. FROM detections
  5.  
  6. GROUP BY datetime(timestamp/1000,'unixepoch') HAVING dupes > 2
  7.  
  8.         AND instr(dup_ids, ',') >0
[/code]
It works great, but I thought about something.
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).

Because in this way I can have a triplet shown but different user_ID. It shows your code but also shows most triplets where there really are 2 users which is also correct.

In the JPG attachment with explanation ;)


Title: Re: Sorting and Counting
Post by: TRon on July 11, 2020, 09:38:13 pm
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).
Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).

However, this is starting to turn into a string manipulation contest. At least SQLite was not designed to do such things (in an easy manner). Other SQL databases perhaps might though.

Have you considered creating your own custom function(s) using Pascal ? see also: http://www.sqlite.org/c3ref/create_function.html as unfortunately SQLite does not seem to support the statement "create function".

Quote
In the JPG attachment with explanation ;)
Just for the record. In the dataset you shared with us, there is no such data. In that selection, there doesn't seem to exist any data that matches the criteria with having more than 3 3 or more distinct ID's.

edit: stupid typo.
Title: Re: Sorting and Counting
Post by: rvk on July 11, 2020, 10:25:33 pm
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).
Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).
Something with having count(DISTINCT device_id) > 2 or likewise???

So (but I can't test this)
Code: SQL  [Select][+][-]
  1. SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
  2. COUNT(datetime(TIMESTAMP/1000,'unixepoch')) AS dupes,
  3. GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
  4. FROM detections
  5. GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
  6. HAVING dupes > 2 AND COUNT(DISTINCT device_id) > 2

I'm not sure you even need the dupes column then??
Code: SQL  [Select][+][-]
  1. SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
  2. GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
  3. FROM detections
  4. GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
  5. HAVING COUNT(DISTINCT device_id) > 2
??
Title: Re: Sorting and Counting
Post by: TRon on July 11, 2020, 11:16:04 pm
Something with having count(DISTINCT device_id) > 2 or likewise???
Interesting.

I didn't know you where allowed to do that, thank you rvk.

Seems to work like a charm.

So, your first statement does the job, the second one (and I agree that it is tempting to want to try) unfortunately seem to include more results then the original statement we started out with. I haven't been able to determine which data exactly it concerns (so unable to tell why).

Other than that I am also unable to test it further as the provided dataset does not contain any data matching the criteria.
Title: Re: Sorting and Counting
Post by: TRon on July 12, 2020, 12:17:30 am
...the second one (and I agree that it is tempting to want to try) unfortunately seem to include more results then the original statement we started out with. I haven't been able to determine which data exactly it concerns (so unable to tell why).
Oh, wait... seems I made an error in my verification SQL statement there.  :-[ Sorry about that.

Yes, your second solution @rvk seems to work also.

@mpknap: as stated before: Change the objective and you can start redesigning your statement(s)  ;)
Title: Re: Sorting and Counting
Post by: mpknap on July 12, 2020, 12:04:50 pm
 
Quote

Code: SQL  [Select][+][-]
  1. SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
  2. COUNT(datetime(TIMESTAMP/1000,'unixepoch')) AS dupes,
  3. GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
  4. FROM detections
  5. GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
  6. HAVING dupes > 2 AND COUNT(DISTINCT device_id) > 2

I'm not sure you even need the dupes column then??
Code: SQL  [Select][+][-]
  1. SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
  2. GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
  3. FROM detections
  4. GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
  5. HAVING COUNT(DISTINCT device_id) > 2
??


Yes! Both codes give the same and correct results. Now the results are clear and transparent .Thanks :) 

Quote
Other than that I am also unable to test it further as the provided dataset does not contain any data matching the criteria.

Quote
@mpknap: as stated before: Change the objective and you can start redesigning your statement(s)  ;)

Yes, I know, but unfortunately the original SQLITE file is 4.5GB.


Quote
Have you considered creating your own custom function(s) using Pascal ? see also: http://www.sqlite.org/c3ref/create_function.html as unfortunately SQLite does not seem to support the statement "create function".

Interesting, but possible in Pascal?
Title: Re: Sorting and Counting
Post by: Thaddy on July 12, 2020, 12:31:14 pm
Interesting, but possible in Pascal?
Of course. I use that all the time for my special needs... Mind the cdecl for your external libraries, though.
Title: Re: Sorting and Counting
Post by: rvk on July 12, 2020, 12:37:53 pm
Quote
@mpknap: as stated before: Change the objective and you can start redesigning your statement(s)  ;)
Yes, I know, but unfortunately the original SQLITE file is 4.5GB.
Yikes. That's more than the 104.000 lines you gave before. That illustrates the point extra. You should have mentioned that at the beginning. A simple one TStringList solution with sorting in memory isn't really feasible in that case and we would have suggested a DB solution from the beginning.
Title: Re: Sorting and Counting
Post by: TRon on July 12, 2020, 07:07:11 pm
Interesting, but possible in Pascal?
As Thaddy already wrote, yes

For an example see FreePascal package fcl-db/examples/sqlite3extdemo.pp (and accompanied myext.pp)

Yes, I know, but unfortunately the original SQLITE file is 4.5GB.
Ah, the final requirements/conditions. It took only #210 posts  ;)

As rvk already wrote, a vital piece if information that should have been mentioned from the start imho. Even in case you are not able to share (all) the data. It just makes it a bit more difficult to verify (at least in my case, since i'm fairly new to SQLite).
TinyPortal © 2005-2018