Print Page - Sorting and Counting

Programming => General => Topic started by: mpknap on July 16, 2019, 07:35:20 am

Title: Sorting and Counting
Post by: mpknap on July 16, 2019, 07:35:20 am

Hello.
I have a text file and in it unixtime times,
this is how it looks:

file.txt
1562551080

1562551140

1562551260

1562551260

1562551260

1562551260

1562551320

1562551380
.......

The file is very large. about 20MB.

Many lines are repeated, others are only once. The task that I want to do is to calculate how many times it is the same. So make an array with Unix time and the number of repetitions for a given time.

After sorting and calculation it should be like this:
1562551080 - 1

1562551140 - 1

1562551260 - 4

1562551320 -1

1562551380 -1
......

Any ideas?

Title: Re: Sorting and Counting
Post by: mangakissa on July 16, 2019, 08:17:48 am

You give the solution by yourself; a two dimensional array.

Title: Re: Sorting and Counting
Post by: julkas on July 16, 2019, 09:07:00 am

Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf

Title: Re: Sorting and Counting
Post by: avk on July 16, 2019, 09:11:17 am

A data structure called "multiset" is well suited to your task.

Title: Re: Sorting and Counting
Post by: mangakissa on July 16, 2019, 04:20:30 pm

Quote from: julkas on July 16, 2019, 09:07:00 am

Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf

FGL is standard with TFPGMap in it.

But this is easy to do with an 2 dimensional array. After finding the doubles, you can sort in every way yo want.

Title: Re: Sorting and Counting
Post by: jamie on July 16, 2019, 10:47:22 pm

Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

Title: Re: Sorting and Counting
Post by: winni on July 16, 2019, 11:29:05 pm

Yes, as Jamie said:

* load it into a stringlist
* set the stringlist.sorted to true - and wait, 20 MB mlight take some time
* count the duplicates
* and now: write them with your format: ( time - count ) into a second stringlist
* save it to disk - done

In case of Linux/Unix: this goes really faster with the shell and sort and uniq …….....

Winni

Title: Re: Sorting and Counting
Post by: mpknap on July 17, 2019, 07:00:05 am

Quote from: jamie on July 16, 2019, 10:47:22 pm

Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby. In life I am Bob the Builder :). I like to write something in my spare time, probably only to prove myself. I am interested in the CREDO project, I use their data. This time I want to juxtapose them with the detection of global earthquakes, I wonder if it has any connection, although nobody writes about it ;).

Here are some tools for the CREDO project https://github.com/credo-science/Windows-Tools

And the CREDO project itself here: https://credo.science/

Yes, I started, but I do not know how to count duplicates and how to do an array with the results. I have so much code for now. Good start?

Code: Pascal [Select][+]

procedure TForm1.Button1Click(Sender: TObject);
const
  BLOCK_SIZE = 10000;
 
var
  f: textfile;
  Times: TStringList;
  det, separator: string;
  i: integer;
 
begin
  czas := TStringList.Create;
  czas.Duplicates := dupIgnore;
  czas.Sorted := True;
 
  AssignFile(f, 'Base_total.txt');
  reset(f);
 
  SetLength(det, 0);
  FillChar(det, SizeOf(det), 0);
  while not EOF(f) do
  begin
 
    if i mod BLOCK_SIZE = 0 then
      SetLength(det, Length(det) + BLOCK_SIZE);
 
    readln(f, det);
    czas.Add(det);
    readln(f, separator);
    Inc(i);
  end;
  closefile(f);
end;    

Title: Re: Sorting and Counting
Post by: engkin on July 17, 2019, 08:47:31 am

Maybe this:

Code: Pascal [Select][+]

procedure CountDuplicates(AFileName, AResultFileName: String);
var
  List, Res: TStringList;
  Item: String = '';
  Count: integer = 0;
  i: Integer = 0;
begin
  List := TStringList.Create;
  Res := TStringList.Create;
  try
    List.LoadFromFile(AFileName);
    List.Sorted := True;
    while i<List.Count do
    begin
      // Get current item
      Item := List[i];
 
      // Count similar items
      Count := 0;
      while (i<List.Count) and (List[i]=Item) do
      begin
        inc(i);
        inc(count);
      end;
 
      // Add the result
      Res.Add(Format('%s - %d', [Item, Count]));
    end;
 
    Res.SaveToFile(AResultFileName);
  finally
    List.Free;
    Res.Free;
  end;
end;
 

To use it:

Code: Pascal [Select][+]

  CountDuplicates('in.txt','out.txt');

The result would be saved in out.txt file.
Notice that the first result is going to be the number of empty lines, if any.

Title: Re: Sorting and Counting
Post by: mangakissa on July 17, 2019, 08:53:44 am

Code: Pascal [Select][+]

type
 
  TMyTime = packed record
    Unixtime : string;
    Counter  : word;
  end;
 
implementation
 
procedure TForm1.Button1Click(Sender: TObject);
var f      : textfile;
    myline : string;
    MyTime : array of TMyTime;
    index  : integer;
begin
  assignfile(f,'file.txt');
  reset(f);
  index := 0;
  while not eof(f) do
  begin
    readln(f,myline);
    if not find(MyTime, myline) then
    begin
      index := index + 1;
      setlength(MyTime,index);
      MyTime[index - 1].Unixtime := myLine;
      MyTime[index - 1].Counter := 1;
    end;
  end;
  closefile(f);
  //use bubblesort, quicksort, heapsort or other sort
  for index := 0 to length(myTime) - 1 do
    memo1.Lines.add(format('time : %s   duplicates : %3d',[myTime[index].Unixtime,myTime[index].Counter]));
end;
 
function TForm1.Find(var aMyTime : array of TMytime; const aLine : string) : boolean;
var index : integer;
begin
  result := false;
  if length(aMyTime) > 0 then
  begin
    for index := low(aMyTime) to high(aMyTime) do
    begin
      if aMyTime[index].Unixtime = aLine then
      begin
        aMyTime[index].Counter := aMyTime[index].Counter + 1;
        result := true;
        break;
      end;
    end;
  end;
end;
 
end.
 

stringlist is nice but for small files. and uses a lot of resources for nothing. Okay, things already built in like sort, but in this way you actually see what you're doing.

Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 01:32:40 pm

Well, I'm glad to see that I'm not the only one who believes that TStringList is not quite suitable for this task (if only because unixtime is a number, and sorting numbers is much faster than sorting strings). However, the task of counting duplicate values is fairly common, and the standard library should have a ready solution for it.

Title: Re: Sorting and Counting
Post by: howardpc on July 17, 2019, 02:13:18 pm

I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.

Code: Pascal [Select][+]

unit mainSortCount;
 
{$mode objfpc}{$H+}
 
interface
 
uses
   Classes, SysUtils, Forms, StdCtrls;
 
type
   TForm1 = class(TForm)
     Memo1: TMemo;
     procedure FormCreate(Sender: TObject);
   end;
 
var
   Form1: TForm1;
 
  procedure SortCount(const anInFile: String; out aList: TStringList);
 
  procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
 
implementation
 
{$R *.lfm}
 
{ TForm1 }
 
procedure SortCount(const anInFile: String; out aList: TStringList);
const
  one = PtrUInt(1);
var
  textf: TextFile;
  s: String;
  idx: Integer;
 
  function GetSuccObj(anIntObj: TObject): TObject;
  var
    i: PtrUInt absolute anIntObj;
  begin
    Inc(i);
    Exit(anIntObj);
  end;
 
begin
  Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  aList := TStringList.Create;
  aList.Duplicates := dupError;
  aList.Sorted := True;
  AssignFile(textf, anInFile);
  try
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, s);
        s := Trim(s);
        idx := aList.IndexOf(s);
        case idx of
          -1: aList.AddObject(s, TObject(one));
          else
            aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
        end;
      end;
  finally
    CloseFile(textf);
  end;
end;
 
procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
var
  i: Integer;
begin
  if Assigned(aList) and Assigned(aMemo) then
    begin
      aMemo.Clear;
      for i := 0 to aList.Count-1 do
        aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
    end;
end;
 
procedure TForm1.FormCreate(Sender: TObject);
var
  sl: TStringList;
begin
  SortCount('infile.txt', sl);
  try
    ShowListInMemo(sl, Memo1);
    Memo1.Lines.SaveToFile('outfile.txt');
  finally
    sl.Free;
  end;
end;
 
end.

Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 03:20:20 pm

Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's one(22 s.) and it is prohibitively slow for a task of this size.

Title: Re: Sorting and Counting
Post by: julkas on July 17, 2019, 05:29:50 pm

GMap is fast enough. I will post my random test tomorrow.

Title: Re: Sorting and Counting
Post by: marcov on July 17, 2019, 05:56:51 pm

gmap might depend on the number of unique values. If that gets large it might also slow down.

Title: Re: Sorting and Counting
Post by: julkas on July 17, 2019, 06:07:38 pm

Quote from: marcov on July 17, 2019, 05:56:51 pm

gmap might depend on the number of unique values. If that gets large it might also slow down.

About ~ 10000000 random longint values.

Title: Re: Sorting and Counting
Post by: avk on July 17, 2019, 06:40:45 pm

@julkas, to generate a test file, I used this:

Code: Pascal [Select][+]

procedure CreateTestFile(const aFileName: string);
var
  I: Integer;
const
  TestSize = 1750000;
begin
  with TStringList.Create do
    try
      for I := 1 to TestSize do
        Add(IntToStr(1500000000 + Random(800000)));
      SaveToFile(aFileName);
    finally
      Free;
    end;
end; 
 

The output is a ~20MB text file, containing ~700000 unique values.

Title: Re: Sorting and Counting
Post by: Akira1364 on July 17, 2019, 08:42:08 pm

Here's a example / benchmark of a version that uses Generics.Collections (available by default in trunk FPC, but also compatible with 3.0.4 if you just copy the sources to somewhere in whatever your unit search path is.)

Performance seems to be quite good (runs in around 0.8 - 1.1 seconds on average for me.)

Code: Pascal [Select][+]

program Example;
 
{$mode Delphi}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TIntMap = TDictionary<LongInt, LongInt>;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  I: LongInt;
  Start: TDateTime;
  InFile, OutFile: Text;
  Map: TIntMap;
  Pair: TIntPair;
  Pairs: TArray<TIntPair>;
 
begin
  // Generate a random test file first, with an adaptation of avk's method.
  Randomize();
  Assign(InFile, 'data.txt');
  Rewrite(InFile);
  for I := 1 to 1750000 do
    WriteLn(InFile, 1500000000 + Random(800000));
  Close(InFile);
  I := 0;
  // Start the timer only now, as this is where the real work starts.
  Start := Now();
  Map := TIntMap.Create();
  // Allocate a big chunk of memory here in advance, for performance's sake.
  // Doesn't matter if it's more than you end up needing, as Capacity is separate from Count.
  Map.Capacity := 1750000;
  Assign(InFile, 'data.txt');
  Reset(InFile);
  while not EOF(InFile) do begin
    ReadLn(InFile, I);
    if not Map.ContainsKey(I) then
      Map.Add(I, 1)
    else
      Map[I] := Map[I] + 1;
  end;
  Close(InFile);
  Pairs := Map.ToArray();
  TArrayHelper<TIntPair>.Sort(
    Pairs,
    TComparer<TIntPair>.Construct(ComparePairs)
  );
  Assign(OutFile, 'output.txt');
  Rewrite(OutFile);
  for Pair in Pairs do with Pair do
    WriteLn(OutFile, Key, ' - ', Value);
  Close(OutFile);
  Map.Free();
  WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
end.

Title: Re: Sorting and Counting
Post by: howardpc on July 17, 2019, 08:47:16 pm

Quote from: avk on July 17, 2019, 03:20:20 pm

Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's one(22 s.) and it is prohibitively slow for a task of this size.

I have changed my mind.

A generic container like TStringList is not well suited to this task on large files.
Here follows a simple dynamic array solution. The following routine executes in less than 2s on my average 5-year-old machine, using a test file generated from your code.

Code: Pascal [Select][+]

program SortCountlpi;
 
{$mode objfpc}{$H+}
 
uses
  Classes, sysutils;
 
procedure CreateTestFile(const aFileName: string);
var
  I: Integer;
const
  TestSize = 1750000;
begin
  with TStringList.Create do
    try
      for I := 1 to TestSize do
        Add(IntToStr(1500000000 + Random(800000)));
      SaveToFile(aFileName);
    finally
      Free;
    end;
end;
 
procedure SortCountViaArray(const anInfile, anOutFile: String);
var
  arr: array of Integer;
  textf: TextFile;
  min: Integer = High(Integer);
  max: Integer = -1;
  i: Integer;
begin
  if FileExists(anInfile) then begin
    AssignFile(textf, anInfile);
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, anOutFile);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
end;
 
var
  t: TDateTime;
 
begin
  CreateTestFile('infile.txt');
  t := Time;
  SortCountViaArray('infile.txt', 'outfile.txt');
  Writeln('Elapsed time: ',DateTimeToTimeStamp(Time -t).Time,' ms');
end.

Title: Re: Sorting and Counting
Post by: mpknap on July 17, 2019, 09:28:26 pm

Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.

:)

Title: Re: Sorting and Counting
Post by: jamie on July 17, 2019, 11:13:25 pm

If you want faster you need to create a single chuck of memory that will house all of the file at once..

Load the file into that memory and write some code that shuffles the lines around. The idea is to not allow the memory manager to keep fragmentating the operation.

Also putting it in a more simpler way, your fields are all the same size which makes this easy and since they are numbers only they can be converted into a Binary number and placed in a array to suites your needs..

Once in the array you sort the array with simple algorithms.

So step one is the read the file in line by line, convert each entry to a binary number and then store it in the
array..

Once you have this data stored you then sort the array using something like a bubble sort.

The array size can be calculated a head of time so you can dynamically create it..

ArraySize := FileSize Div_NUmber_Chars_PerEntry_Plus2_For CRLF;

So your array could be this
Array of Int64;
SetLength(MyArray, ArraySize);

etc.
the rest is just code...

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 12:30:37 am

Quote from: howardpc on July 17, 2019, 08:47:16 pm

A generic container like TStringList is not well suited to this task on large files.

Well, an actually-generic hashmap that deals directly with the relevant type as opposed to unnecessarily stringifying the values is definitely well-suited to it. Your array solution is certainly quite fast though.

Here's both of ours wrapped up into a single command-line app with an option to tell it which to use:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode Delphi}
{$ImplicitExceptions Off}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections;
 
type
  TIntPair = TPair<LongInt, LongInt>;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    if FileExists(ParamStr(2)) then begin
      Map := TDictionary<LongInt, LongInt>.Create();
      Map.Capacity := 10000000;
      Assign(InOut, ParamStr(2));
      Reset(InOut);
      while not EOF(InOut) do begin
        ReadLn(InOut, I);
        if not Map.ContainsKey(I) then
          Map.Add(I, 1)
        else
          Map[I] := Map[I] + 1;
      end;
      Close(InOut);
      Pairs := Map.ToArray();
      Map.Free();
      TArrayHelper<TIntPair>.Sort(
        Pairs,
        TComparer<TIntPair>.Construct(ComparePairs)
      );
      Assign(InOut, ParamStr(3));
      Rewrite(InOut);
      for Pair in Pairs do with Pair do
        WriteLn(InOut, Key, ' - ', Value);
      Close(InOut);
    end;
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    if FileExists(ParamStr(2)) then begin
      AssignFile(textf, ParamStr(2));
      Reset(textf);
      while not EOF(textf) do
        begin
          ReadLn(textf, i);
          if i < min then
            min := i;
          if i > max then
            max := i;
        end;
      SetLength(arr, max-min+1);
 
      Reset(textf);
      while not EOF(textf) do
        begin
          ReadLn(textf, i);
          Dec(i, min);
          Inc(arr[i]);
        end;
      CloseFile(textf);
 
      AssignFile(textf, ParamStr(3));
      Rewrite(textf);
      for i := Low(arr) to High(arr) do
        case (arr[i] > 0) of
          True: WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
        end;
      CloseFile(textf);
      SetLength(arr, 0);
    end;
  end;
 
var
  Start: TDateTime;
 
begin
  if ParamCount() <> 3 then
    WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename')
  else if ParamStr(1) = '-akira' then
  begin
    Start := Now();
    SortCountAkira();
    WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  end
  else if ParamStr(1) = '-howard' then
  begin
    Start := Now();
    SortCountHoward();
    WriteLn(MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
  end
  else
    WriteLn('Usage: occurrencecounter (-akira | -howard) infilename outfilename');  
end.

Also a generator program for the input file (which gives something a little more heavyweight than what people have been using so far):

Code: Pascal [Select][+]

program InputGenerator;
 
var
  InFile: Text;
  I: LongInt;
 
begin
  Randomize();
  Assign(InFile, 'infile.txt');
  Rewrite(InFile);
  for I := 1 to 10000000 do
    WriteLn(InFile, 1500000000 + Random(800000));
  Close(InFile);
end.

Quote from: mpknap on July 17, 2019, 09:28:26 pm

The TXT file is large, sorting takes more than an hour.

Wait, really? That's thousands and thousands of times too long if it's only 20MB or so.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 03:41:54 am

@Akira1364, it seems your InputGenerator creates more than 99% of unique values.

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 04:13:47 am

Quote from: avk on July 18, 2019, 03:41:54 am

@Akira1364, it seems your InputGenerator creates more than 99% of unique values.

I was kind of going for testing the "worst case scenario", but it does seem a bit too worse case now that you point it out. I just edited it a bit in the comment.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 07:43:50 am

I slightly changed the Akira1364's program:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode Delphi}
{$ImplicitExceptions Off}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, , LGAbstractContainer, LGHashMultiSet, LGArrayHelpers;
 
type
  TIntPair = TPair<LongInt, LongInt>;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique: Integer;
  Start: TDateTime;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    Map := TDictionary<LongInt, LongInt>.Create();
    Map.Capacity := 10000000;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    AssignFile(textf, ParamStr(1));
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, ParamStr(2));
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  function EntryCmp(constref L, R: TGMultisetEntry<Integer>): SizeInt;
  begin
    if L.Key > R.Key then
      Result := 1
    else
      if L.Key < R.Key then
        Result := -1
      else
        Result := 0;
  end;   
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    Counter := CountRef;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, ParamStr(2));
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(@EntryCmp) do
          with e do
             WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
      WriteLn('total values: ', Total, ', unique values: ', Unique);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  if ParamCount <> 2 then
    begin
      WriteLn('Usage: occurrencecounter infilename outfilename');
      exit;
    end;
  if not FileExists(ParamStr(1)) then
    begin
      WriteLn('Input file "', ParamStr(1), '" not found');
      exit;
    end;
 
  WriteLn('running SortCountAkira:');
  Run(@SortCountAkira);
  WriteLn;
 
  WriteLn('running SortCountHoward:');
  Run(@SortCountHoward);
  WriteLn;
 
 
  WriteLn('running SortCountAvk1:');
  Run(@SortCountAvk1);
  WriteLn;
 
  WriteLn('running SortCountAvk2:');
  Run(@SortCountAvk2);
end.
 

using current version of his InputGenerator output is(32-bit compiler):

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 4.7730
total values: 10000000, unique values: 799999
 
running SortCountHoward:
elapsed time: 6.0220
total values: 10000000, unique values: 799999
 
running SortCountAvk1:
elapsed time: 3.7910
total values: 10000000, unique values: 799999
 
running SortCountAvk2:
elapsed time: 2.9480
total values: 10000000, unique values: 799999
 

using previous version:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 14.5700
total values: 10000000, unique values: 9976599
 
running SortCountHoward:
crashes with message "Invalid pointer operation"
 
running SortCountAvk1:
elapsed time: 8.1750
total values: 10000000, unique values: 9976599
 
running SortCountAvk2:
elapsed time: 6.6450
total values: 10000000, unique values: 9976599
 

Title: Re: Sorting and Counting
Post by: Thaddy on July 18, 2019, 08:37:51 am

@ AVK
You can optimize a little with {$I-}{$H-} or did you already do that? {$I-} speeds up writeln consideably.
Also I wouldn't use Randomize() but use a fixed seed, so the file becomes reproducable.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 09:25:34 am

@Thaddy, I think it does not make much sense. The idea was to show that for this task there are ways to solve without shamanic dances with a tambourine and in a reasonable time.

Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 10:00:23 am

GMap random test. Out file size ~ 146 MB

Code: Pascal [Select][+]

program sc;
{$mode delphi}
uses gmap, gutil, SysUtils;
const
  keyNum = 10000000;
type
  TIntLess = TLess<LongInt>;
  TDict = TMap<LongInt, LongInt, TIntLess>;
var
  sc: TDict;
  scit: TDict.TIterator;
  i: LongInt;
  key, cnt: LongInt;
  start: QWord;
  outFile: Text;
begin
  start := GetTickCount64();
  sc := TDict.Create;
  for i := 0 to keyNum do
  begin
    key := Random(2147483647);
    cnt := 0;
    sc.TryGetValue(key, cnt);
    sc[key] := cnt + 1;
  end;
  WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  WriteLn('Uniq keys - ', sc.Size, ', out of - ', keyNum);
  Assign(outFile, 'out.txt');
  Rewrite(outFile);
  scit := sc.Min;
  repeat
    WriteLn(outFile, scit.Key, ' - ', scit.Value)
  until not scit.Next;
  Close(outFile);
  scit.Destroy;
  sc.Destroy;
  WriteLn('Total (ticks) - ', GetTickCount64() - start);
  ReadLn;
end.
 

Console output -

Code: Text [Select][+]

Populated (ticks) - 17297
Uniq keys - 9976566, out of - 10000000
Total (ticks) - 23906

Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 11:06:58 am

Quote from: mpknap on July 17, 2019, 09:28:26 pm

Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.

Can you share input file?

Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 12:39:15 pm

I did this for more than 25 million postal codes. It was really slow with the default containers and took a lot of memory. I made a custom class that used a smart way to insert new values, which reduced it to less than 15 minutes and 2 GB memory usage. But I stored more than one number in more than one container. Requesting a postal code was nearly instant.

I kept a small, loose index and remembered the last few insertions. After that, I just tried to insert a new entry halfway between the two nearest indexed items, and then repeatedly halfway until a hit was found. Which took less than 5 tries on average.

Title: Re: Sorting and Counting
Post by: Thaddy on July 18, 2019, 12:49:02 pm

It may be a lot quicker with a memory mapped file approach. Especially on very long files. Then the readln is from memory most of the time, not from disk.

Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 01:22:15 pm

I made my own SaveToFile and LoadFromFile, which created a text file that was 4.5 GB in size, IIRC. It took 2 minutes to load that into memory. But less than a millisecond to query.

Btw, this was on a fast Linux sever with lots of memory and SSDs.

Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 01:48:32 pm

Quote from: mpknap on July 17, 2019, 07:00:05 am

In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby.

Check also this - https://en.wikipedia.org/wiki/External_sorting.

Title: Re: Sorting and Counting
Post by: mangakissa on July 18, 2019, 02:19:49 pm

My sample with quicksort and a 30 mb file size takes 69 seconds WITHOUT ANY TSTRINGLIST.
Just old program developping.

Title: Re: Sorting and Counting
Post by: SymbolicFrank on July 18, 2019, 02:37:03 pm

Storage in memory is also interesting. A linked list allows fast inserts, but is slow to access. An array has to be shifted. And indexes consist of arrays as well.

IIRC, I ended up with 100 separate arrays that all did a subset. That was fastest overall. And finding the right one was based on quicksort.

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 02:49:45 pm

Quote from: avk on July 18, 2019, 07:43:50 am

I slightly changed the Akira1364's program:

Yeah, LGenerics is really good. The MultiSet is definitely an even more appropriate data structure than a hashmap (which TDictionary is) for this.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 04:05:49 pm

Quote from: Akira1364 on July 18, 2019, 02:49:45 pm

Yeah, LGenerics is really good...

Nice to hear kindly words from you. :)

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 04:19:41 pm

Quote from: avk on July 18, 2019, 04:05:49 pm

Nice to hear kindly words from you. :)

Oh, I didn't realize you were the author! Great job on it, again. It's by far the most advanced / feature-complete generics library ever written in Object Pascal.

On another note, I'm still extremely curious as to what "mpknap" is doing with his file currently that takes over an hour... is he using BogoSort or something? That's just crazily long.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 04:28:05 pm

Thank you very much.
@mangakissa, can you submit full version of your solution?

Title: Re: Sorting and Counting
Post by: mangakissa on July 18, 2019, 04:52:20 pm

the whole project :D

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 06:33:03 pm

@julkas, @mangakissa
I changed your code a bit and pasted it into the Akira1364's app:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode Delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, gmap;
 
type
  TIntPair = TPair<LongInt, LongInt>;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique: Integer;
  Start: TDateTime;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    Map := TDictionary<LongInt, LongInt>.Create();
    Map.Capacity := 10000000;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    AssignFile(textf, ParamStr(1));
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, ParamStr(2));
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    Counter := CountRef;
    Counter.LoadFactor := 0.7;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, ParamStr(2));
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas;
  type
    TIntLess = TLess<LongInt>;
    TDict = TMap<LongInt, LongInt, TIntLess>;
  var
    sc: TDict;
    scit: TDict.TIterator;
    InOut: Text;
    key, cnt: LongInt;
  begin
    sc := TDict.Create;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        cnt := 0;
        sc.TryGetValue(key, cnt);
        sc[key] := cnt + 1;
      end;
    Close(InOut);
    Unique := sc.Size;
    if Unique > 0 then
      begin
        Assign(InOut, ParamStr(2));
        Rewrite(InOut);
        scit := sc.Min;
        repeat
          WriteLn(InOut, scit.Key, ' - ', scit.Value);
        until not scit.Next;
        Close(InOut);
        scit.Free;
      end;
    sc.Free;
  end;
 
  procedure SortCountMangakissa;
  type
    TMyTime = packed record
      Unixtime : Integer;
      Counter  : word;
    end;
 
    function Find(var aMyTime : array of TMytime; aLine : Integer) : boolean;
    var index : integer;
    begin
      result := false;
      if length(aMyTime) > 0 then
      begin
        for index := low(aMyTime) to high(aMyTime) do
        begin
          if aMyTime[index].Unixtime = aLine then
          begin
            aMyTime[index].Counter := aMyTime[index].Counter + 1;
            result := true;
            break;
          end;
        end;
      end;
    end;
 
    procedure QuickSort(var A: array of tMytime; iLo, iHi: Integer) ;
     var
       Lo, Hi, Pivot : Integer;
       T             : TMyTime;
     begin
       Lo := iLo;
       Hi := iHi;
       Pivot := A[(Lo + Hi) div 2].Unixtime;
       repeat
         while A[Lo].Unixtime < Pivot do Inc(Lo) ;
         while A[Hi].Unixtime > Pivot do Dec(Hi) ;
         if Lo <= Hi then
         begin
           T := A[Lo];
           A[Lo] := A[Hi];
           A[Hi] := T;
           Inc(Lo) ;
           Dec(Hi) ;
         end;
       until Lo > Hi;
       if Hi > iLo then QuickSort(A, iLo, Hi) ;
       if Lo < iHi then QuickSort(A, Lo, iHi) ;
     end;
  var
    InOut         : Text;
    MyTime        : array of TMyTime = nil;
    Item          : TMyTime;
    myline, index : integer;
  begin
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    index := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, myline);
        Inc(Total);
        if not find(MyTime, myline) then
        begin
          index := index + 1;
          setlength(MyTime,index);
          MyTime[index - 1].Unixtime := myLine;
          MyTime[index - 1].Counter := 1;
        end;
      end;
    Close(InOut);
    if MyTime = nil then
      exit;
    Unique := Length(MyTime);
    QuickSort(Mytime, low(MyTime), high(mytime));
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    for Item in Mytime do
      with Item do
        WriteLn(InOut, Unixtime, ' - ', Counter);
    Close(InOut);
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
      WriteLn('#total: ', Total, ', #unique: ', Unique);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  if ParamCount <> 2 then
    begin
      WriteLn('Usage: OccurrenceCounter infilename outfilename');
      exit;
    end;
  if not FileExists(ParamStr(1)) then
    begin
      WriteLn('Input file "', ParamStr(1), '" not found');
      exit;
    end;
 
  WriteLn('running SortCountAkira:');
  Run(SortCountAkira);
  WriteLn;
 
  WriteLn('running SortCountHoward:');
  Run(@SortCountHoward);
  WriteLn;
 
 
  WriteLn('running SortCountAvk1:');
  Run(SortCountAvk1);
  WriteLn;
 
  WriteLn('running SortCountAvk2:');
  Run(SortCountAvk2);
  WriteLn;
 
  WriteLn('running SortCountJulkas:');
  Run(SortCountJulkas);
  WriteLn;
 
  WriteLn('running SortCountMangakissa:');
  Run(SortCountMangakissa);
end.
 

Results for 20MB file:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 1.6700
#total: 1750000, #unique: 710248
 
running SortCountHoward:
elapsed time: 1.5500
#total: 1750000, #unique: 710248
 
running SortCountAvk1:
elapsed time: 1.0100
#total: 1750000, #unique: 710248
 
running SortCountAvk2:
elapsed time: 0.7200
#total: 1750000, #unique: 710248
 
running SortCountJulkas:
elapsed time: 2.6600
#total: 1750000, #unique: 710248
 
running SortCountMangakissa:
elapsed time: 346.6510
#total: 1750000, #unique: 710248
 
 

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 07:44:00 pm

@Avk:

What do you get if you change the

Code: Pascal [Select][+]

Map.Capacity := 10000000;

in mine to something closer to the expected total for the 20MB file (as opposed to the 100+ MB one.)

Like:

Code: Pascal [Select][+]

Map.Capacity := 1750000;

or whatever. Or even if you just comment out that line completely, as for that much smaller expected input it may not be worth it at all over how TDictionary pre-allocates internally already.

Also FWIW, I'm quite sure that for Mangakissa's version, it's this part:

Code: Pascal [Select][+]

index := index + 1;
setlength(MyTime,index);

that hurts the performance.

Title: Re: Sorting and Counting
Post by: julkas on July 18, 2019, 07:44:44 pm

@avk thanks for benchmark.
It's old but very good problem. You can learn something new - sorting algorithm, hashing, counting, ...
My approach is not fastest, but clean and short. I use only one class from FPC 3.0.4. May be tomorrow I will post faster variant.

Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 07:56:39 pm

@avk

did you use Akira's InputGenerator program to generate the input file ?

One reason I'm asking is because the code I copied from Akira's post generates a 120MB file. Did you use a modified version ? If so, could you also post whatever program you used to generate the input file ?

Thank you.

Title: Re: Sorting and Counting
Post by: Akira1364 on July 18, 2019, 08:12:57 pm

Quote from: 440bx on July 18, 2019, 07:56:39 pm

did you use Akira's InputGenerator program to generate the input file ?

I imagine he used the original "20MB" code of his from an earlier comment for that one.

If you change my generator program to look like this:

Code: Pascal [Select][+]

program InputGenerator;
 
var
  InFile: Text;
  I: LongInt;
 
begin
  Randomize();
  Assign(InFile, 'infile.txt');
  Rewrite(InFile);
  for I := 1 to 1750000 do
    WriteLn(InFile, 1500000000 + Random(800000));
  Close(InFile);
end.

It will create the smaller 20MB file (with 1750000 total entries, versus the 10000000 total entries that the 100+ MB file has.)

Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 08:25:35 pm

Quote from: Akira1364 on July 18, 2019, 08:12:57 pm

I imagine he used the original "20MB" code of his from an earlier comment for that one.

I will use that. Thank you Akira.

Title: Re: Sorting and Counting
Post by: avk on July 18, 2019, 08:30:02 pm

Quote from: Akira1364 on July 18, 2019, 07:44:00 pm

What do you get if you change the ...

just commented:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 1.3890
#total: 1750000, #unique: 710248
 
running SortCountHoward:
elapsed time: 1.5600
#total: 1750000, #unique: 710248
 
running SortCountAvk1:
elapsed time: 0.9980
#total: 1750000, #unique: 710248
 
running SortCountAvk2:
elapsed time: 0.7490
#total: 1750000, #unique: 710248
 
running SortCountJulkas:
elapsed time: 2.6210
#total: 1750000, #unique: 710248
 
running SortCountMangakissa:
elapsed time: 342.3730
#total: 1750000, #unique: 710248
 

Map.Capacity := 1750000:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 1.4040
#total: 1750000, #unique: 710248
 
running SortCountHoward:
elapsed time: 1.6130
#total: 1750000, #unique: 710248
 
running SortCountAvk1:
elapsed time: 1.0440
#total: 1750000, #unique: 710248
 
running SortCountAvk2:
elapsed time: 0.7430
#total: 1750000, #unique: 710248
 
running SortCountJulkas:
elapsed time: 2.7800
#total: 1750000, #unique: 710248
 

Quote from: Akira1364 on July 18, 2019, 07:44:00 pm

Also FWIW, I'm quite sure that for Mangakissa's version, it's this part ...

It also seems that the array search has a quadratic complexity.

@440bx
These results was for 20MB file because of SortCountMangakissa.

For current version Akira's InputGenerator:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 5.0910
#total: 10000000, #unique: 799999
 
running SortCountHoward:
elapsed time: 6.3260
#total: 10000000, #unique: 799999
 
running SortCountAvk1:
elapsed time: 4.1010
#total: 10000000, #unique: 799999
 
running SortCountAvk2:
elapsed time: 3.0760
#total: 10000000, #unique: 799999
 
running SortCountJulkas:
elapsed time: 14.3140
#total: 10000000, #unique: 799999
 

Title: Re: Sorting and Counting
Post by: mpknap on July 18, 2019, 08:41:06 pm

Gentlemen, I do not have time to deal with testing your algorithms. I will deal with Monday, on holidays;).
I am also glad that the topic is interesting.

Yesterday I wrote that the Engkin function has been working for an hour. She worked longer until the night was blue screen because the laptop was reset. But the fault was the TXT file. There was an empty line between the lines with the time Stamp. Today I made a TXT file without empty lines and after a few minutes the Out.txt file appeared.
I'm happy because that's what I meant :)

Soon I will check your other ideas. I was unsure that I will get such an encyclopedia from you;)

Title: Re: Sorting and Counting
Post by: 440bx on July 18, 2019, 08:54:57 pm

Quote from: avk on July 18, 2019, 08:30:02 pm

@440bx
For current version Akira's InputGenerator:

Excellent!. Thank you Avk.

Title: Re: Sorting and Counting
Post by: howardpc on July 19, 2019, 12:14:51 am

To see what effect varying the size of the data file, and varying the number of unique values to be counted might have on the faster algorithms presented here I adapted Akira1364's program as follows.

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, gmap;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TProcedureArray = array of procedure;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique, repeatCount, randomrange: Integer;
  Start: TDateTime;
  inFilename: String = 'data.txt';
  outFilename: String = 'sorted.txt';
  routineName: String;
  procedures: TProcedureArray;
  proc: procedure;
 
  procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  var
    InFile: Text;
    I: LongInt;
  begin
    Assign(InFile, inFilename);
    Rewrite(InFile);
    for I := 1 to repeatMillionsCount * 1000000 do
      WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
    Close(InFile);
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    routineName := {$I %currentroutine%};
    Map := TDictionary<LongInt, LongInt>.Create();
    Map.Capacity := 10000000;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, outFilename);
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    routineName := {$I %currentroutine%};
    AssignFile(textf, inFilename);
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, outFilename);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, i+min, ' - ', arr[i]);
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    routineName := {$I %currentroutine%};
    Counter := CountRef;
    Counter.LoadFactor := 0.7;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    routineName := {$I %currentroutine%};
    Assign(InOut, inFilename);
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, outFilename);
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas;
  type
    TIntLess = TLess<LongInt>;
    TDict = TMap<LongInt, LongInt, TIntLess>;
  var
    sc: TDict;
    scit: TDict.TIterator;
    InOut: Text;
    key, cnt: LongInt;
  begin
    routineName := {$I %currentroutine%};
    sc := TDict.Create;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        cnt := 0;
        sc.TryGetValue(key, cnt);
        sc[key] := cnt + 1;
      end;
    Close(InOut);
    Unique := sc.Size;
    if Unique > 0 then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        scit := sc.Min;
        repeat
          WriteLn(InOut, scit.Key, ' - ', scit.Value);
        until not scit.Next;
        Close(InOut);
        scit.Free;
      end;
    sc.Free;
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  Randomize;
 
  procedures := TProcedureArray.Create(SortCountJulkas, SortCountAkira, SortCountHoward, SortCountAvk1, SortCountAvk2);
 
  for randomrange := 1 to 10 do
    begin
      GenerateData(randomrange);
      WriteLn(#10'RandomRange = ',randomrange);
      for proc in procedures do
        Run(proc);
    end;
 
  for repeatCount := 1 to 10 do
    begin
      GenerateData(8, 2*repeatCount);
      WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
      for proc in procedures do
        Run(proc);
    end;
end.

Some typical output is as follows:

Code: Pascal [Select][+]

repeatMillionsCount = 16
 Julkas's time: 15.8210 #unique: 800000 #total: 16000000
  Akira's time: 4.7230 #unique: 800000 #total: 16000000
 Howard's time: 4.4930 #unique: 800000 #total: 16000000
   Avk1's time: 3.9390 #unique: 800000 #total: 16000000
   Avk2's time: 2.4480 #unique: 800000 #total: 16000000
 
repeatMillionsCount = 18
 Julkas's time: 17.4640 #unique: 800000 #total: 18000000
  Akira's time: 5.2670 #unique: 800000 #total: 18000000
 Howard's time: 5.0560 #unique: 800000 #total: 18000000
   Avk1's time: 4.4020 #unique: 800000 #total: 18000000
   Avk2's time: 2.7890 #unique: 800000 #total: 18000000
 
repeatMillionsCount = 20
 Julkas's time: 19.2770 #unique: 800000 #total: 20000000
  Akira's time: 5.7910 #unique: 800000 #total: 20000000
 Howard's time: 5.6010 #unique: 800000 #total: 20000000
   Avk1's time: 4.8510 #unique: 800000 #total: 20000000
   Avk2's time: 3.0690 #unique: 800000 #total: 20000000

For avk to design an algorithm that can analyse a 120 MB text file of 20,000,000 items, sort and count its data and and write out the result to a second file in 3 seconds on my ageing machine is pretty impressive.

Title: Re: Sorting and Counting
Post by: mangakissa on July 19, 2019, 08:36:47 am

Can this be done with {$mode objfpc}?

Title: Re: Sorting and Counting
Post by: Thaddy on July 19, 2019, 09:20:20 am

Quote from: mangakissa on July 19, 2019, 08:36:47 am

Can this be done with {$mode objfpc}?

Obviously, yes!, See Howards comments above.

Title: Re: Sorting and Counting
Post by: avk on July 19, 2019, 09:27:31 am

@howardpc, nice, very suitable.
Heh, highly likely my machine is even older than yours(comparing runtimes).
But as for my "algorithm", it seems that you are flattering me, it's just sorting an array and then counting duplicates.
Array helpers do about the same to extract unique(distinct) values.
Your counting sort does the same thing, only at the same time.
I'm more glad for the multiset, it seems it shows itself well.

@mangakissa, sure, just changed my version:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode objfpc}{$H+}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, gmap;
 
type
  TIntPair = specialize TPair<LongInt, LongInt>;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique: Integer;
  Start: TDateTime;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: specialize TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: specialize TArray<TIntPair>;
  begin
    Map := specialize TDictionary<LongInt, LongInt>.Create();
    Map.Capacity := 10000000;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    specialize TArrayHelper<TIntPair>.Sort(
      Pairs,
      specialize TComparer<TIntPair>.Construct(@ComparePairs)
    );
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    AssignFile(textf, ParamStr(1));
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, ParamStr(2));
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, Format('%d - %d',[i+min, arr[i]]));
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = specialize TGHashMultiSetLP<Integer>;
    TCountRef = specialize TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    Counter := CountRef;
    //Counter.LoadFactor := 0.7;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, ParamStr(2));
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(@EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer = nil;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    specialize TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas;
  type
    TIntLess = specialize TLess<LongInt>;
    TDict = specialize TMap<LongInt, LongInt, TIntLess>;
  var
    sc: TDict;
    scit: TDict.TIterator;
    InOut: Text;
    key, cnt: LongInt;
  begin
    sc := TDict.Create;
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        cnt := 0;
        sc.TryGetValue(key, cnt);
        sc[key] := cnt + 1;
      end;
    Close(InOut);
    Unique := sc.Size;
    if Unique > 0 then
      begin
        Assign(InOut, ParamStr(2));
        Rewrite(InOut);
        scit := sc.Min;
        repeat
          WriteLn(InOut, scit.Key, ' - ', scit.Value);
        until not scit.Next;
        Close(InOut);
        scit.Free;
      end;
    sc.Free;
  end;
 
  procedure SortCountMangakissa;
  type
    TMyTime = packed record
      Unixtime : Integer;
      Counter  : word;
    end;
 
    function Find(var aMyTime : array of TMytime; aLine : Integer) : boolean;
    var index : integer;
    begin
      result := false;
      if length(aMyTime) > 0 then
      begin
        for index := low(aMyTime) to high(aMyTime) do
        begin
          if aMyTime[index].Unixtime = aLine then
          begin
            aMyTime[index].Counter := aMyTime[index].Counter + 1;
            result := true;
            break;
          end;
        end;
      end;
    end;
 
    procedure QuickSort(var A: array of tMytime; iLo, iHi: Integer) ;
     var
       Lo, Hi, Pivot : Integer;
       T             : TMyTime;
     begin
       Lo := iLo;
       Hi := iHi;
       Pivot := A[(Lo + Hi) div 2].Unixtime;
       repeat
         while A[Lo].Unixtime < Pivot do Inc(Lo) ;
         while A[Hi].Unixtime > Pivot do Dec(Hi) ;
         if Lo <= Hi then
         begin
           T := A[Lo];
           A[Lo] := A[Hi];
           A[Hi] := T;
           Inc(Lo) ;
           Dec(Hi) ;
         end;
       until Lo > Hi;
       if Hi > iLo then QuickSort(A, iLo, Hi) ;
       if Lo < iHi then QuickSort(A, Lo, iHi) ;
     end;
  var
    InOut         : Text;
    MyTime        : array of TMyTime = nil;
    Item          : TMyTime;
    myline, index : integer;
  begin
    Assign(InOut, ParamStr(1));
    Reset(InOut);
    index := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, myline);
        Inc(Total);
        if not find(MyTime, myline) then
        begin
          index := index + 1;
          setlength(MyTime,index);
          MyTime[index - 1].Unixtime := myLine;
          MyTime[index - 1].Counter := 1;
        end;
      end;
    Close(InOut);
    if MyTime = nil then
      exit;
    Unique := Length(MyTime);
    QuickSort(Mytime, low(MyTime), high(mytime));
    Assign(InOut, ParamStr(2));
    Rewrite(InOut);
    for Item in Mytime do
      with Item do
        WriteLn(InOut, Unixtime, ' - ', Counter);
    Close(InOut);
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
      WriteLn('#total: ', Total, ', #unique: ', Unique);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  if ParamCount <> 2 then
    begin
      WriteLn('Usage: OccurrenceCounter infilename outfilename');
      exit;
    end;
  if not FileExists(ParamStr(1)) then
    begin
      WriteLn('Input file "', ParamStr(1), '" not found');
      exit;
    end;
 
  WriteLn('running SortCountAkira:');
  Run(@SortCountAkira);
  WriteLn;
 
  WriteLn('running SortCountHoward:');
  Run(@SortCountHoward);
  WriteLn;
 
 
  WriteLn('running SortCountAvk1:');
  Run(@SortCountAvk1);
  WriteLn;
 
  WriteLn('running SortCountAvk2:');
  Run(@SortCountAvk2);
  WriteLn;
 
  WriteLn('running SortCountJulkas:');
  Run(@SortCountJulkas);
  WriteLn;
 
  WriteLn('running SortCountMangakissa:');
  Run(@SortCountMangakissa);
end.   
 

output:

Code: Text [Select][+]

running SortCountAkira:
elapsed time: 1.4130
#total: 1750000, #unique: 710248
 
running SortCountHoward:
elapsed time: 1.5600
#total: 1750000, #unique: 710248
 
running SortCountAvk1:
elapsed time: 1.0770
#total: 1750000, #unique: 710248
 
running SortCountAvk2:
elapsed time: 0.7170
#total: 1750000, #unique: 710248
 
running SortCountJulkas:
elapsed time: 2.6370
#total: 1750000, #unique: 710248
 
running SortCountMangakissa:
elapsed time: 375.5530
#total: 1750000, #unique: 710248
 

Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 09:31:39 am

Here are a couple of implementations using the Windows API.

The first version does not use Readln.

The second version does not use Readln nor Writeln except for 2 writeln(s) to the console.

NOTE: the input file name is hardcoded to be "infile.txt" for convenience (easy to change that if needed.)

Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 10:28:35 am

Here is my faster (more compicated) version with TVector. Output file ~ 147 MB. Uniq count fixed.

Code: Pascal [Select][+]

program sc2;
{$mode delphi}
uses gvector, gutil, garrayutils, SysUtils;
const
  keyNum = 10000000;
type
  TIntLess = TLess<LongInt>;
  TIntVect = TVector<LongInt>;
  TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
var
  sc: array[0..21474] of TIntVect;
  i: LongInt;
  pkey, key, cnt, uniq: LongInt;
  offset: LongInt;
  start: QWord;
  outFile: Text;
begin
  start := GetTickCount64();
  for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
  for i := 0 to keyNum do
  begin
    key := Random(2147483647);
    sc[key div 100000].PushBack(key mod 100000);
  end;
  WriteLn('Populated (ticks) - ', GetTickCount64() - start);
 
  Assign(outFile, 'out.txt');
  Rewrite(outFile);
 
  for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
 
  offset := -100000;
  uniq := 0;
  for i := Low(sc) to High(sc) do
  begin
    Inc(offset, 100000);
    pkey := -1;
    cnt := 0;
    for key in sc[i] do
    begin
      if pkey <> key then
      begin
        if cnt <> 0 then
        begin
          WriteLn(outFile, offset + pkey, ' - ', cnt);
          Inc(uniq);
        end;
        pkey := key;
        cnt := 0;
      end;
      Inc(cnt);
    end;
    if cnt <> 0 then 
    begin
      WriteLn(outFile, offset + pkey, ' - ', cnt);
      Inc(uniq);
    end;
  end;
 
  Close(outFile);
  for i := Low(sc) to High(sc) do
    sc[i].Destroy;
 
  WriteLn('Total (ticks) - ', GetTickCount64() - start);
  WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  ReadLn;
end.

Console output.

Code: Text [Select][+]

Populated (ticks) - 1312
Total (ticks) - 4922
Uniq keys - 9976566, out of - 10000000

Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 02:43:43 pm

@Akira1364, @avk can you include my second algo in benchmark and run again? I have 3.0.4 and can't use LGenerics.
Thanks.

Title: Re: Sorting and Counting
Post by: avk on July 19, 2019, 08:10:44 pm

@julkas, done
@440bx:
So, do you offer to compete with the Windows file mapping and ntdll? Okay, why not. :)
To run your SortCount (from SortCount2) inside a Howard's benchmark, I put the SortCount2 code in a separate unit:

Code: Pascal [Select][+]

// LONGSTRINGS ON needed for ParamStr to return a null terminated string
 
{$LONGSTRINGS ON}
 
unit WinSortCount2;
 
interface
uses
  windows,
  sysutils,
  dateutils
  ;
 
const
  ntdll            = 'ntdll.dll';
  kernel32         = 'kernel32.dll';
 
var
  // constants to replace ParamStr(1) and ParamStr(2) respectively
 
  InFileName: string = '';
  OutFileName: string = '';
 
  DataCount : ptruint;
  Unique    : ptruint;
 
  procedure SortCount;
 
implementation
// -----------------------------------------------------------------------------
// input file related types
 
type
  // the input file is made up of unixtimes occupying 10 bytes followed by
  // a CR/LF (on Windows)  To avoid comparing strings (which is slow) we define
  // the unixtime characters as being composed on a qword and a word followed
  // the the CR/LF ending.
 
  TINPUT_FILE_ELEMENT = packed record
    case boolean of
    0 : (
         ife_hi                : qword;       // first 8 bytes of unixtime
         ife_lo                : word;        // trailing 2 bytes of unixtime
 
         ife_LineEnding        : word;        // CRLF
        );
    1 : (
          ife_unixtime         : packed array[0..9] of char;
          ife_crlf             : word;
        );
  end;
  PINPUT_FILE_ELEMENT = ^TINPUT_FILE_ELEMENT;
 
 
 
// -----------------------------------------------------------------------------
// kernel32 related types and functions
 
type
  PSECURITY_ATTRIBUTES = ^TSECURITY_ATTRIBUTES;
  TSECURITY_ATTRIBUTES = record
    Length                      : DWORD;
    PointerToSecurityDescriptor : pointer;
    InheritHandle               : boolean32;
  end;
 
 
function GetFileSizeEx(    FileHandle        : THANDLE;
                       var PointerToFileSize : qword)
         : boolean32; stdcall; external kernel32;
 
function CreateFileA(PointerToFileName           : pchar;
                     DesiredAccess               : DWORD;
                     ShareMode                   : DWORD;
                     PointerToSecurityAttributes : PSECURITY_ATTRIBUTES;
                     CreationDisposition         : DWORD;
                     FlagsAndAttributes          : DWORD;
                     TemplateFileHandle          : THANDLE)
         : THANDLE; stdcall; external kernel32;
 
 
// -----------------------------------------------------------------------------
// ntdll related types and functions
 
type
  TCompareFunction = function (key : pointer; data : pointer) : ptrint; cdecl;
 
const
  COMPARE_EQUAL    =  0;
  COMPARE_GREATER  =  1;
  COMPARE_LESS     = -1;
 
 
 
procedure qsort(Base            : pointer;
                ElementCount    : ptruint;
                ElementSize     : ptruint;
                CompareFunction : TCompareFunction);
          cdecl; external ntdll;
 
procedure RtlMoveMemory(Destination : pointer;
                        Source      : pointer;
                        BytesToCopy : ptruint);
          stdcall; external ntdll;
 
 
const
  CRLF                      = #13#10;
  FORMAT            : pchar = '%10.10s  -  %d' + CRLF;
 
  // if there are no duplicates at all, then the output file will be the size
  // of the input file plus 6 additional characters (see FORMAT above)
 
  ADDITIONAL_OUTPUT         = 6;
 
// NOTE : don't user user32.dll wsprintf, it's a paraplegic dog.
 
function sprintf(OutputDestination : pchar;
                 Format            : pchar;
                 UnixTime          : pchar;
                 Count             : integer) : integer; cdecl;    { CDECL !!}
  external ntdll name 'sprintf';
 
// -----------------------------------------------------------------------------
 
function CompareUnixTimes(EntryA, EntryB : PINPUT_FILE_ELEMENT)
         : ptrint; cdecl;
begin
  if EntryA^.ife_hi > EntryB^.ife_hi then exit(COMPARE_GREATER);
  if EntryA^.ife_hi < EntryB^.ife_hi then exit(COMPARE_LESS);
 
  // the first qword of both entries is the same, use the last 2 bytes of the
  // unixtime
 
  if EntryA^.ife_lo > EntryB^.ife_lo then exit(COMPARE_GREATER);
  if EntryA^.ife_lo < EntryB^.ife_lo then exit(COMPARE_LESS);
 
  // they are the same
 
  result := COMPARE_EQUAL;
end;
 
// -----------------------------------------------------------------------------
 
procedure Error(Id : ptruint);
begin
  write('FATAL : ');
  case Id of
    1 : writeln('LoadInputFileIntoMemory failed');
    2 : writeln('WriteOutputFile - failed to create output file');
    3 : writeln('WriteOutputFile - failed to write to output file');
  end;
 
  halt(Id);
end;
 
// -----------------------------------------------------------------------------
 
function LoadInputFileIntoMemory (Filename     : pchar;
                              var Filesize     : qword;
                              var OutputBuffer : pchar)
         : PINPUT_FILE_ELEMENT;
  // maps the input file in memory, determines its size and copies the input
  // file into a separate memory block (because the input file is not supposed
  // to be sorted.)  Also allocates a buffer for the output file.
var
  FileHandle           : THANDLE =   0;
  FileMapping          : THANDLE =   0;
  FileMapAddress       : pointer = nil;
 
  // variables that will be returned upon success
 
  FileData             : pointer = nil;
  Size                 : qword   =   0;
  FileOutBuffer        : pointer = nil;
 
  // to create a local scope (strictly local)
 
  SCOPE                : integer =   0;
 
  UnixTimesCount       : qword   =   0;
 
const
  // constants used by CreateFile
 
  NO_TEMPLATE_FILE     =       0;
 
  // constants used by CreateFileMapping
 
  NO_MAXIMUM_SIZE_HIGH =       0;      // 0 indicates to use the size of the
  NO_MAXIMUM_SIZE_LOW  =       0;      //   file
 
  // constants used by MapViewOfFileEx
 
  FILE_OFFSET_HIGH     =       0;      //   file offset to map from
  FILE_OFFSET_LOW      =       0;
 
  BEGINNING_TO_END     =       0;
 
begin
  // initialize return values
 
  result       := nil;
  Filesize     := 0;
  OutputBuffer := nil;
 
  // map the input file and allocate necessary resources.
 
  for SCOPE := 1 to 1 do       // trick to create a scope one can break out of
  begin
    FileHandle := CreateFileA(Filename,
                              GENERIC_READ,
                              FILE_SHARE_READ,
                              nil,
                              OPEN_EXISTING,
                              FILE_ATTRIBUTE_NORMAL,
                              NO_TEMPLATE_FILE);
 
    if FileHandle = INVALID_HANDLE_VALUE   then break;
 
    if not GetFileSizeEx(FileHandle, Size) then break;
    if Size = 0                            then break;       // empty file
 
    // with the file handle, create a mapping for it
 
    FileMapping := CreateFileMappingA(FileHandle,
                                      nil,
                                      PAGE_READONLY,
                                      NO_MAXIMUM_SIZE_HIGH,  // use file size
                                      NO_MAXIMUM_SIZE_LOW,
                                      nil);
 
    if (FileMapping = 0)                   then break;
 
    FileMapAddress := MapViewOfFileEx(FileMapping,
                                      FILE_MAP_READ,
                                      FILE_OFFSET_HIGH,      // from beginning
                                      FILE_OFFSET_LOW,
                                      BEGINNING_TO_END,      // to end
                                      nil);                  // map anywhere
 
    if FileMapAddress = nil                then break;
 
    // allocate a memory block to hold the file data since the file itself
    // won't be sorted
 
    FileData := HeapAlloc(GetProcessHeap(),
                          0,
                          Size);
 
    if FileData = nil                      then break;
 
 
    // copy the data in the file to the block we just allocated
 
    RtlMoveMemory(FileData, FileMapAddress, Size);
 
    // allocate a buffer for the output file
 
    UnixTimesCount := Size div sizeof(TINPUT_FILE_ELEMENT);
    FileOutBuffer  := HeapAlloc(GetProcessHeap(),
                                0,
                                Size + (UnixTimesCount * ADDITIONAL_OUTPUT));
 
    if FileOutBuffer = nil then break;        // just in case additional
                                              // instructions are added at a
                                              // later time.
 
  end;
 
  if (FileHandle     <> INVALID_HANDLE_VALUE) then CloseHandle(FileHandle);
  if (FileMapping    <>                    0) then CloseHandle(FileMapping);
 
  if (FileMapAddress <> nil) then UnmapViewOfFile(FileMapAddress);
 
  if (FileData <> nil) then
  begin
    Filesize     := Size;
    OutputBuffer := FileOutBuffer;
    result       := FileData;
  end;
end;
 
// -----------------------------------------------------------------------------
 
procedure WriteOutputFile(OutputFilename               : pchar;
                          DataOutputBuffer, DataOutPtr : pchar);
const
  FILE_NO_SHARE    = 0;
  NO_TEMPLATE_FILE = 0;
 
var
  FileHandle : THANDLE;
  ByteCount  : ptruint;
 
  Ok           : BOOL  = FALSE;
  BytesWritten : DWORD = 0;
 
begin
  FileHandle := CreateFileA(OutputFilename,
                            GENERIC_READ or GENERIC_WRITE,
                            FILE_NO_SHARE,
                            nil,
                            CREATE_ALWAYS,
                            FILE_ATTRIBUTE_NORMAL,
                            NO_TEMPLATE_FILE);
 
  if FileHandle = INVALID_HANDLE_VALUE then Error(2);
 
  ByteCount := DataOutPtr - DataOutputBuffer;
 
 
  Ok := WriteFile(FileHandle,
                  DataOutputBuffer^,
                  ByteCount,
                  BytesWritten,
                  nil);
 
  if not Ok then Error(3);
 
  CloseHandle(FileHandle);
end;
 
// -----------------------------------------------------------------------------
 
procedure SortCount;
var
  Data             : PINPUT_FILE_ELEMENT = nil;
 
  DataEnd          : PINPUT_FILE_ELEMENT = nil;
  DataOutputBuffer : pchar               = nil;
  DataOutPtr       : pchar               = nil;
  DataOutLength    : integer             =   0;
 
  Filesize         : qword               =   0;  // compiler whines otherwise
 
  InstanceCount    : integer             =   0;
 
  i, j             : PINPUT_FILE_ELEMENT;
 
begin
  Data := LoadInputFileIntoMemory(PChar(InFileName), Filesize, DataOutputBuffer);
 
  if Data = nil then Error(1);    // an empty file is treated as an error
 
  // sort the data
 
  DataCount := Filesize div sizeof(TINPUT_FILE_ELEMENT);
 
  qsort(Data,
        DataCount,
        sizeof(Data^),
        TCompareFunction(@CompareUnixTimes)
       );
 
  // use the same algorithm used by Avk to produce the output file.
 
  Unique        := 0;
  InstanceCount := 0;
 
  i        := Data;
  j        := i;
  DataEnd  := Data + DataCount;
 
  // determine the duplicate counts
 
  DataOutPtr := DataOutputBuffer;
 
  repeat
    while (j < DataEnd)
      and
          ((j^.ife_hi = i^.ife_hi) and (j^.ife_lo = i^.ife_lo))
       do
    begin
      inc(InstanceCount);
      inc(j);
    end;
 
    DataOutLength := sprintf(DataOutPtr,
                             FORMAT,
                             i^.ife_unixtime, InstanceCount);
 
    inc(DataOutPtr, DataOutLength);
 
    InstanceCount := 0;
    inc(Unique);
 
    i := j;
  until j >= DataEnd;
 
  WriteOutputFile(PChar(OutFileName), DataOutputBuffer, DataOutPtr);
 
  //writeln('#unique : ', Unique, '   #Total : ', DataCount);
 
  HeapFree(GetProcessHeap(), 0, Data);
  HeapFree(GetProcessHeap(), 0, DataOutputBuffer);
end;
 
// -----------------------------------------------------------------------------
//
//var
//  Start : TDATETIME;
//
//begin
//  Start := Now;
//
//  SortCount;
//
//  writeln;
//  writeln('elapsed time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4);
end.
 

output:

Code: Text [Select][+]

 
RandomRange = 5
Julkas1's time: 2.7450 #unique: 490842 #total: 2000000
Julkas2's time: 0.9680 #unique: 0 #total: 2000000
  Akira's time: 1.4500 #unique: 490842 #total: 2000000
 Howard's time: 1.2950 #unique: 490842 #total: 2000000
   Avk1's time: 0.9830 #unique: 490842 #total: 2000000
   Avk2's time: 0.7330 #unique: 490842 #total: 2000000
  440bx's time: 0.7490 #unique: 490842 #total: 2000000
 
RandomRange = 6
Julkas1's time: 2.8080 #unique: 578734 #total: 2000000
Julkas2's time: 0.9820 #unique: 0 #total: 2000000
  Akira's time: 1.5140 #unique: 578734 #total: 2000000
 Howard's time: 1.3260 #unique: 578734 #total: 2000000
   Avk1's time: 1.0290 #unique: 578734 #total: 2000000
   Avk2's time: 0.7800 #unique: 578734 #total: 2000000
  440bx's time: 0.7180 #unique: 578734 #total: 2000000
 
RandomRange = 7
Julkas1's time: 2.9330 #unique: 659876 #total: 2000000
Julkas2's time: 0.9990 #unique: 0 #total: 2000000
  Akira's time: 1.5750 #unique: 659876 #total: 2000000
 Howard's time: 1.3420 #unique: 659876 #total: 2000000
   Avk1's time: 1.0920 #unique: 659876 #total: 2000000
   Avk2's time: 0.7950 #unique: 659876 #total: 2000000
  440bx's time: 0.7340 #unique: 659876 #total: 2000000
 
RandomRange = 8
Julkas1's time: 3.0110 #unique: 733819 #total: 2000000
Julkas2's time: 1.0450 #unique: 0 #total: 2000000
  Akira's time: 1.6540 #unique: 733819 #total: 2000000
 Howard's time: 1.3880 #unique: 733819 #total: 2000000
   Avk1's time: 1.1230 #unique: 733819 #total: 2000000
   Avk2's time: 0.8270 #unique: 733819 #total: 2000000
  440bx's time: 0.7490 #unique: 733819 #total: 2000000
 
RandomRange = 9
Julkas1's time: 3.0880 #unique: 802323 #total: 2000000
Julkas2's time: 1.0460 #unique: 0 #total: 2000000
  Akira's time: 1.7310 #unique: 802323 #total: 2000000
 Howard's time: 1.3890 #unique: 802323 #total: 2000000
   Avk1's time: 1.2320 #unique: 802323 #total: 2000000
   Avk2's time: 0.8420 #unique: 802323 #total: 2000000
  440bx's time: 0.7650 #unique: 802323 #total: 2000000
 
RandomRange = 10
Julkas1's time: 3.1520 #unique: 864809 #total: 2000000
Julkas2's time: 1.0600 #unique: 0 #total: 2000000
  Akira's time: 1.8260 #unique: 864809 #total: 2000000
 Howard's time: 1.4190 #unique: 864809 #total: 2000000
   Avk1's time: 1.2330 #unique: 864809 #total: 2000000
   Avk2's time: 0.8730 #unique: 864809 #total: 2000000
  440bx's time: 0.7650 #unique: 864809 #total: 2000000
 
repeatMillionsCount = 10
Julkas1's time: 13.8380 #unique: 799999 #total: 10000000
Julkas2's time: 4.2580 #unique: 0 #total: 10000000
  Akira's time: 4.9460 #unique: 799999 #total: 10000000
 Howard's time: 5.8180 #unique: 799999 #total: 10000000
   Avk1's time: 4.0100 #unique: 799999 #total: 10000000
   Avk2's time: 3.0730 #unique: 799999 #total: 10000000
  440bx's time: 3.1670 #unique: 799999 #total: 10000000
 
repeatMillionsCount = 12
Julkas1's time: 16.4420 #unique: 799999 #total: 12000000
Julkas2's time: 5.0550 #unique: 0 #total: 12000000
  Akira's time: 5.7870 #unique: 799999 #total: 12000000
 Howard's time: 6.9110 #unique: 799999 #total: 12000000
   Avk1's time: 4.7110 #unique: 799999 #total: 12000000
   Avk2's time: 3.6350 #unique: 799999 #total: 12000000
  440bx's time: 3.7600 #unique: 799999 #total: 12000000
 
repeatMillionsCount = 14
Julkas1's time: 19.2500 #unique: 800000 #total: 14000000
Julkas2's time: 5.8660 #unique: 0 #total: 14000000
  Akira's time: 6.6300 #unique: 800000 #total: 14000000
 Howard's time: 8.0650 #unique: 800000 #total: 14000000
   Avk1's time: 5.4290 #unique: 800000 #total: 14000000
   Avk2's time: 4.1970 #unique: 800000 #total: 14000000
  440bx's time: 4.3360 #unique: 800000 #total: 14000000
 
repeatMillionsCount = 16
Julkas1's time: 22.1520 #unique: 800000 #total: 16000000
Julkas2's time: 6.6620 #unique: 0 #total: 16000000
  Akira's time: 7.4250 #unique: 800000 #total: 16000000
 Howard's time: 9.1400 #unique: 800000 #total: 16000000
   Avk1's time: 6.1000 #unique: 800000 #total: 16000000
   Avk2's time: 4.7580 #unique: 800000 #total: 16000000
  440bx's time: 4.9300 #unique: 800000 #total: 16000000
 
repeatMillionsCount = 18
Julkas1's time: 24.3990 #unique: 800000 #total: 18000000
Julkas2's time: 7.4720 #unique: 0 #total: 18000000
  Akira's time: 8.2060 #unique: 800000 #total: 18000000
 Howard's time: 10.2340 #unique: 800000 #total: 18000000
   Avk1's time: 6.8170 #unique: 800000 #total: 18000000
   Avk2's time: 5.3040 #unique: 800000 #total: 18000000
  440bx's time: 5.5220 #unique: 800000 #total: 18000000
 
repeatMillionsCount = 20
Julkas1's time: 27.5810 #unique: 800000 #total: 20000000
Julkas2's time: 8.3310 #unique: 0 #total: 20000000
  Akira's time: 9.0160 #unique: 800000 #total: 20000000
 Howard's time: 11.3890 #unique: 800000 #total: 20000000
   Avk1's time: 7.5340 #unique: 800000 #total: 20000000
   Avk2's time: 5.9130 #unique: 800000 #total: 20000000
  440bx's time: 6.0840 #unique: 800000 #total: 20000000
 

Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 08:33:07 pm

@avk Thanks. Strange, why uniq is zero in my second algorithm.
I have tested it with @Akira1364 benchmark (without LGenerics).

Title: Re: Sorting and Counting
Post by: howardpc on July 19, 2019, 08:37:56 pm

I don't get a zero Unique with julkas' second implementation.
A typical comparison looks like this (I'm on Linux here, so have sadly omitted 440bx)

Code: Pascal [Select][+]

 Julkas1's time: 4.1120 #unique: 794594 #total: 4000000
   Akira's time: 1.4990 #unique: 794594 #total: 4000000
  Howard's time: 1.2950 #unique: 794594 #total: 4000000
    Avk1's time: 1.2910 #unique: 794594 #total: 4000000
 Julkas2's time: 1.0920 #unique: 794594 #total: 4000000
    Avk2's time: 0.7510 #unique: 794594 #total: 4000000

Title: Re: Sorting and Counting
Post by: julkas on July 19, 2019, 09:10:48 pm

Hm, I can improve my second edition ;D

Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 09:11:34 pm

Quote from: avk on July 19, 2019, 08:10:44 pm

@julkas, done
@440bx:
So, do you offer to compete with the Windows file mapping and ntdll? Okay, why not. :)
To run your SortCount (from SortCount2) inside a Howard's benchmark, I put the SortCount2 code in a separate unit:

I appreciate that, thank you. :)

I have to say, I am impressed with the performance of TGOrdinalArrayHelper.

Code: Pascal [Select][+]

RandomRange = 5
  Avk2's time: 0.7330 #unique: 490842 #total: 2000000
  440bx's time: 0.7490 #unique: 490842 #total: 2000000
 
RandomRange = 6
  Avk2's time: 0.7800 #unique: 578734 #total: 2000000
  440bx's time: 0.7180 #unique: 578734 #total: 2000000
 
RandomRange = 7
  Avk2's time: 0.7950 #unique: 659876 #total: 2000000
  440bx's time: 0.7340 #unique: 659876 #total: 2000000
 
RandomRange = 8
  Avk2's time: 0.8270 #unique: 733819 #total: 2000000
  440bx's time: 0.7490 #unique: 733819 #total: 2000000
 
RandomRange = 9
  Avk2's time: 0.8420 #unique: 802323 #total: 2000000
  440bx's time: 0.7650 #unique: 802323 #total: 2000000
 
RandomRange = 10
  Avk2's time: 0.8730 #unique: 864809 #total: 2000000
  440bx's time: 0.7650 #unique: 864809 #total: 2000000
 

I expected results like the above because using the API one doesn't have to incur in string to binary conversion back and forth (and that is only in this particular case.) On the other hand, I can't quite figure out why when the test is repeated, your implementation performs better.

Quote from: avk on July 19, 2019, 08:10:44 pm

Code: Pascal [Select][+][-]
repeatMillionsCount = 10
Avk2's time: 3.0730 #unique: 799999 #total: 10000000
440bx's time: 3.1670 #unique: 799999 #total: 10000000

repeatMillionsCount = 12
Avk2's time: 3.6350 #unique: 799999 #total: 12000000
440bx's time: 3.7600 #unique: 799999 #total: 12000000

repeatMillionsCount = 14
Avk2's time: 4.1970 #unique: 800000 #total: 14000000
440bx's time: 4.3360 #unique: 800000 #total: 14000000

repeatMillionsCount = 16
Avk2's time: 4.7580 #unique: 800000 #total: 16000000
440bx's time: 4.9300 #unique: 800000 #total: 16000000

repeatMillionsCount = 18
Avk2's time: 5.3040 #unique: 800000 #total: 18000000
440bx's time: 5.5220 #unique: 800000 #total: 18000000

repeatMillionsCount = 20
Avk2's time: 5.9130 #unique: 800000 #total: 20000000
440bx's time: 6.0840 #unique: 800000 #total: 20000000

The differences are not large but, they are consistent. When executed many times over, your implementation performs better. One possible reason that comes to mind is, qsort has to call the compare function "many times" and, I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.) I thought that not having to do string to integer conversions would compensate for the overhead of passing parameters to the sort function, apparently that is not enough.

Either way, TGOrdinalArrayHelper does a very nice job.

Are those results for 32bit programs or 64bits programs ? I'm thinking that in 64bit, the API implementation may be able to match TGOrdinalArrayHelper because it doesn't have to push the pointer arguments onto the stack.

ETA:

Quote from: howardpc on July 19, 2019, 08:37:56 pm

A typical comparison looks like this (I'm on Linux here, so have sadly omitted 440bx)

One thing is for sure, I cannot compete on portability <chuckle>

Title: Re: Sorting and Counting
Post by: Akira1364 on July 19, 2019, 09:46:04 pm

Quote from: 440bx on July 19, 2019, 09:11:34 pm

I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.)

As a (happy!) user of LGenerics, I can tell you: it doesn't. As it's TGOrdinalArrayHelper, it uses a built-in sort specifically designed for, well, ordinals (or at least, anything that can be directly compared to them without need of an explicit comparison function, which could possibly also be achieved via operator overloading in the case of more complex types.)

For anything that doesn't fit that description it just won't compile if specialized with them (failing explicitly on the lines that attempt to do the comparisons, obviously.)

LGenerics does also though use a classic (well, classic as much as concepts mostly only relevant since generics were introduced to the language can be considered classic) trick for avoiding the fact that FPC cannot inline function pointers (and thus could not inline "comparer callbacks" if written with an API that used them directly.)

Instead, it takes structured types containing static "class functions" with specific "required names" for different tasks as generic parameters, which can be inlined (if marked inline) they're not being accessed through pointers.

Here's a simplistic example of what I'm talking about (note: the following code is not actually something from LGenerics)

Code: Pascal [Select][+]

program Example;
 
{$mode Delphi}
 
type
  TCallbackFilterer<T> = record
  public type
    TFilterTest = function(constref Current: T): Boolean;
  public
    // ↓↓↓ Test is a function pointer and can never be inlined
    class function Filter(constref Values: array of T; const Test: TFilterTest): TArray<T>; static;
  end;
 
  class function TCallbackFilterer<T>.Filter(constref Values: array of T; const Test: TFilterTest): TArray<T>;
  var I, J: PtrUInt;
  begin
    J := 0;
    SetLength(Result, Length(Values));
    for I := 0 to High(Values) do
      if Test(Values[I]) then begin
        Result[J] := Values[I];
        Inc(J);
      end;
    SetLength(Result, J);
  end;
 
type
  TFunctorFilterer<T, TFilterImpl> = record
  (*
    TFilterImpl being something like:
 
    type TNotGenericTester = record
    public
      class function Test(constref Current: SomeType): Boolean; static; inline;
    end;
 
    or like:
 
    type TGenericTester<T> = record
    public
      class function Test(constref Current: T): Boolean; static; inline;
    end;
  *)
  public
    class function Filter(constref Values: array of T): TArray<T>; static;
  end;
 
  class function TFunctorFilterer<T, TFilterImpl>.Filter(constref Values: array of T): TArray<T>;
  var I, J: PtrUInt;
  begin
    J := 0;
    SetLength(Result, Length(Values));
    for I := 0 to High(Values) do
      // ↓↓↓ This won't compile if TFilterImpl, whatever it may be, has no "Test" function.
      // ↓↓↓ Again though, if it does, this part (usually) will be inlined if it is marked as such.
      if TFilterImpl.Test(Values[I]) then begin
        Result[J] := Values[I];
        Inc(J);
      end;
    SetLength(Result, J);
  end;
 
begin
end.

Title: Re: Sorting and Counting
Post by: 440bx on July 19, 2019, 10:03:17 pm

Quote from: Akira1364 on July 19, 2019, 09:46:04 pm

As a (happy!) user of LGenerics, I can tell you: it doesn't. As it's TGOrdinalArrayHelper, it uses a built-in sort specifically designed for, well, ordinals (or at least, anything that can be directly compared to them without need of an explicit comparison function, which could possibly also be achieved via operator overloading in the case of more complex types.)

Thank you for clearing that. As you pointed out, the presence of Ordinal in the helper name is a good hint that the helper is designed and optimized for ordinal types. I suspected it but, without looking at the actual code, I didn't want to draw any conclusions based on the just the name.

That would explain why, as the number of elements to sort grows, its advantage over anything that passes parameters on the stack grows as well.

Quote from: Akira1364 on July 19, 2019, 09:46:04 pm

For anything that doesn't fit that description it just won't compile if specialized with them (failing explicitly on the lines that attempt to do the comparisons, obviously.)

That makes perfect sense.

Thank you.

Title: Re: Sorting and Counting
Post by: VTwin on July 20, 2019, 12:53:32 am

Just to see what might be done without using generics, I tried replacing the sort in Akv2 with a HeapSort and then a QuickSort (in howardpc's OccurenceCounter). The HeapSort version was about twice as slow as howardpc's, while the QuickSort version edged out that code. I did not directly compare it to Akv2, but that code smokes.

Very interesting!

Title: Re: Sorting and Counting
Post by: Akira1364 on July 20, 2019, 02:51:29 am

Quote from: VTwin on July 20, 2019, 12:53:32 am

Just to see what might be done without using generics, I tried replacing the sort in Akv2 with a HeapSort and then a QuickSort (in howardpc's OccurenceCounter). The HeapSort version was about twice as slow as howardpc's, while the QuickSort version edged out that code. I did not directly compare it to Akv2, but that code smokes.
Very interesting!

TGOrdinalArrayHelper itself does also have a QuickSort, BTW. Also IntroSort and "DualPivotQuickSort."

Title: Re: Sorting and Counting
Post by: avk on July 20, 2019, 06:17:24 am

@julkas, sorry, it was my mistake,
correct code and output:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, garrayutils, gvector, gmap,
  WinSortCount2;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TProcedureArray = array of procedure;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique, repeatCount, randomrange: Integer;
  Start: TDateTime;
  inFilename: String = 'data.txt';
  outFilename: String = 'sorted.txt';
  routineName: String;
  procedures: TProcedureArray;
  proc: procedure;
 
  procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  var
    InFile: Text;
    I: LongInt;
  begin
    Assign(InFile, inFilename);
    Rewrite(InFile);
    for I := 1 to repeatMillionsCount * 1000000 do
      WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
    Close(InFile);
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    routineName := {$I %currentroutine%};
    Map := TDictionary<LongInt, LongInt>.Create();
    Map.Capacity := 10000000;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, outFilename);
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    routineName := {$I %currentroutine%};
    AssignFile(textf, inFilename);
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, outFilename);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, i+min, ' - ', arr[i]);
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    routineName := {$I %currentroutine%};
    Counter := CountRef;
    Counter.LoadFactor := 0.7;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    routineName := {$I %currentroutine%};
    Assign(InOut, inFilename);
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, outFilename);
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas1;
  type
    TIntLess = TLess<LongInt>;
    TDict = TMap<LongInt, LongInt, TIntLess>;
  var
    sc: TDict;
    scit: TDict.TIterator;
    InOut: Text;
    key, cnt: LongInt;
  begin
    routineName := {$I %currentroutine%};
    sc := TDict.Create;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        cnt := 0;
        sc.TryGetValue(key, cnt);
        sc[key] := cnt + 1;
      end;
    Close(InOut);
    Unique := sc.Size;
    if Unique > 0 then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        scit := sc.Min;
        repeat
          WriteLn(InOut, scit.Key, ' - ', scit.Value);
        until not scit.Next;
        Close(InOut);
        scit.Free;
      end;
    sc.Free;
  end;
 
  procedure SortCountJulkas2;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  var
    sc: array[0..21474] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        sc[key div 100000].PushBack(key mod 100000);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
 
    offset := -100000;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, 100000);
      pkey := -1;
      cnt := 0;
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            WriteLn(InOut, offset + pkey, ' - ', cnt);
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        WriteLn(InOut, offset + pkey, ' - ', cnt);
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do
      sc[i].Destroy;
  end;
 
  procedure SortCount440bx;
  begin
    routineName := {$I %currentroutine%};
    WinSortCount2.DataCount := 0;
    WinSortCount2.Unique := 0;
    WinSortCount2.InFileName := inFilename;
    WinSortCount2.OutFileName := outFilename;
    WinSortCount2.SortCount;
    Total := WinSortCount2.DataCount;
    Unique := WinSortCount2.Unique;
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  Randomize;
 
  procedures := TProcedureArray.Create(
                   SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
                   SortCountAvk2, SortCount440bx);
 
  for randomrange := 5 to 10 do
    begin
      GenerateData(randomrange);
      WriteLn(#10'RandomRange = ',randomrange);
      for proc in procedures do
        Run(proc);
    end;
 
  for repeatCount := 5 to 10 do
    begin
      GenerateData(8, 2*repeatCount);
      WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
      for proc in procedures do
        Run(proc);
    end;
end.
 

Code: Text [Select][+]

 
RandomRange = 5
Julkas1's time: 2.7300 #unique: 490897 #total: 2000000
Julkas2's time: 0.9200 #unique: 490897 #total: 2000000
  Akira's time: 1.4530 #unique: 490897 #total: 2000000
 Howard's time: 1.2790 #unique: 490897 #total: 2000000
   Avk1's time: 0.9770 #unique: 490897 #total: 2000000
   Avk2's time: 0.6910 #unique: 490897 #total: 2000000
  440bx's time: 0.7130 #unique: 490897 #total: 2000000
 
RandomRange = 6
Julkas1's time: 2.8220 #unique: 578415 #total: 2000000
Julkas2's time: 0.9490 #unique: 578415 #total: 2000000
  Akira's time: 1.4860 #unique: 578415 #total: 2000000
 Howard's time: 1.2890 #unique: 578415 #total: 2000000
   Avk1's time: 1.0180 #unique: 578415 #total: 2000000
   Avk2's time: 0.7690 #unique: 578415 #total: 2000000
  440bx's time: 0.7360 #unique: 578415 #total: 2000000
 
RandomRange = 7
Julkas1's time: 3.0370 #unique: 659865 #total: 2000000
Julkas2's time: 1.0280 #unique: 659865 #total: 2000000
  Akira's time: 1.5800 #unique: 659865 #total: 2000000
 Howard's time: 1.3480 #unique: 659865 #total: 2000000
   Avk1's time: 1.0770 #unique: 659865 #total: 2000000
   Avk2's time: 0.7860 #unique: 659865 #total: 2000000
  440bx's time: 0.7690 #unique: 659865 #total: 2000000
 
RandomRange = 8
Julkas1's time: 3.1540 #unique: 734083 #total: 2000000
Julkas2's time: 1.0200 #unique: 734083 #total: 2000000
  Akira's time: 1.6300 #unique: 734083 #total: 2000000
 Howard's time: 1.3630 #unique: 734083 #total: 2000000
   Avk1's time: 1.1290 #unique: 734083 #total: 2000000
   Avk2's time: 0.8000 #unique: 734083 #total: 2000000
  440bx's time: 0.7670 #unique: 734083 #total: 2000000
 
RandomRange = 9
Julkas1's time: 3.0850 #unique: 802832 #total: 2000000
Julkas2's time: 1.0320 #unique: 802832 #total: 2000000
  Akira's time: 1.7190 #unique: 802832 #total: 2000000
 Howard's time: 1.3930 #unique: 802832 #total: 2000000
   Avk1's time: 1.1770 #unique: 802832 #total: 2000000
   Avk2's time: 0.8290 #unique: 802832 #total: 2000000
  440bx's time: 0.7690 #unique: 802832 #total: 2000000
 
RandomRange = 10
Julkas1's time: 3.3060 #unique: 865097 #total: 2000000
Julkas2's time: 1.0560 #unique: 865097 #total: 2000000
  Akira's time: 1.7560 #unique: 865097 #total: 2000000
 Howard's time: 1.4280 #unique: 865097 #total: 2000000
   Avk1's time: 1.2060 #unique: 865097 #total: 2000000
   Avk2's time: 0.8500 #unique: 865097 #total: 2000000
  440bx's time: 0.7820 #unique: 865097 #total: 2000000
 
repeatMillionsCount = 10
Julkas1's time: 13.9870 #unique: 799997 #total: 10000000
Julkas2's time: 4.2000 #unique: 799997 #total: 10000000
  Akira's time: 5.0900 #unique: 799997 #total: 10000000
 Howard's time: 5.9450 #unique: 799997 #total: 10000000
   Avk1's time: 4.0040 #unique: 799997 #total: 10000000
   Avk2's time: 3.0150 #unique: 799997 #total: 10000000
  440bx's time: 3.2640 #unique: 799997 #total: 10000000
 
repeatMillionsCount = 12
Julkas1's time: 17.3100 #unique: 799999 #total: 12000000
Julkas2's time: 5.4000 #unique: 799999 #total: 12000000
  Akira's time: 6.4410 #unique: 799999 #total: 12000000
 Howard's time: 7.3240 #unique: 799999 #total: 12000000
   Avk1's time: 4.6600 #unique: 799999 #total: 12000000
   Avk2's time: 3.5000 #unique: 799999 #total: 12000000
  440bx's time: 3.8170 #unique: 799999 #total: 12000000
 
repeatMillionsCount = 14
Julkas1's time: 19.2340 #unique: 800000 #total: 14000000
Julkas2's time: 5.7660 #unique: 800000 #total: 14000000
  Akira's time: 6.3470 #unique: 800000 #total: 14000000
 Howard's time: 8.0130 #unique: 800000 #total: 14000000
   Avk1's time: 5.3040 #unique: 800000 #total: 14000000
   Avk2's time: 4.0090 #unique: 800000 #total: 14000000
  440bx's time: 4.4350 #unique: 800000 #total: 14000000
 
repeatMillionsCount = 16
Julkas1's time: 21.5130 #unique: 800000 #total: 16000000
Julkas2's time: 6.5050 #unique: 800000 #total: 16000000
  Akira's time: 7.1720 #unique: 800000 #total: 16000000
 Howard's time: 9.0940 #unique: 800000 #total: 16000000
   Avk1's time: 6.7390 #unique: 800000 #total: 16000000
   Avk2's time: 4.7270 #unique: 800000 #total: 16000000
  440bx's time: 5.0230 #unique: 800000 #total: 16000000
 
repeatMillionsCount = 18
Julkas1's time: 27.4910 #unique: 800000 #total: 18000000
Julkas2's time: 7.9570 #unique: 800000 #total: 18000000
  Akira's time: 7.9090 #unique: 800000 #total: 18000000
 Howard's time: 10.1090 #unique: 800000 #total: 18000000
   Avk1's time: 6.7080 #unique: 800000 #total: 18000000
   Avk2's time: 5.1170 #unique: 800000 #total: 18000000
  440bx's time: 5.5690 #unique: 800000 #total: 18000000
 
repeatMillionsCount = 20
Julkas1's time: 26.8320 #unique: 800000 #total: 20000000
Julkas2's time: 8.0490 #unique: 800000 #total: 20000000
  Akira's time: 8.6270 #unique: 800000 #total: 20000000
 Howard's time: 11.1390 #unique: 800000 #total: 20000000
   Avk1's time: 7.3470 #unique: 800000 #total: 20000000
   Avk2's time: 5.6160 #unique: 800000 #total: 20000000
  440bx's time: 6.2250 #unique: 800000 #total: 20000000
 

@Akira1364, thank you for the detailed explanation LGenerics's internals.

Title: Re: Sorting and Counting
Post by: lucamar on July 20, 2019, 06:23:13 am

Fascinating thread.

We should send a copy to Knuth, to show that the flame is being kept alive :D

Title: Re: Sorting and Counting
Post by: 440bx on July 20, 2019, 07:36:17 am

Quote from: lucamar on July 20, 2019, 06:23:13 am

Fascinating thread.

yes, it is.

One thing I noticed is that the number of unique occurrences basically stays constant from 10 million to 20 million. That's likely caused by the range not growing along with the number of unixtimes in the file.

It seems reasonable to presume that a larger number of elements are usually obtained over a longer period of time which would mean that the range (controlled by randomRange) should be larger as a result of that.

Increasing the range as the number of unixtimes grows would likely reflect what happens in reality more accurately.

ETA:

The attached version - SortCount3 (unit) - should perform better than SortCount2 (if the range remains constant instead of increasing.)

Title: Re: Sorting and Counting
Post by: ASerge on July 20, 2019, 10:09:18 am

Looks great, but not testable. Avk, can you upload the full project compilable on fpc 3.0.4? The function from 440bx needs to be wrapped in {$IFDEF WINDOWS}.

Title: Re: Sorting and Counting
Post by: julkas on July 20, 2019, 11:54:24 am

Second edition improvement. CPU - Intel Dual E2200, RAM - 3 GB, SATA HDD. Please update your benchmarks (check Lazarus project mode - debug or release, my - release) ;D

Code: Pascal [Select][+]

program sc2;
{$mode delphi}
uses gvector, gutil, garrayutils, SysUtils;
const
  keyNum = 10000000;
  bsz = 1 shl 17;
type
  TIntLess = TLess<LongInt>;
  TIntVect = TVector<LongInt>;
  TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
var
  sc: array[0..2147483647 shr 17] of TIntVect;
  i: LongInt;
  pkey, key, cnt, uniq: LongInt;
  offset: LongInt;
  start: QWord;
  outFile: Text;
begin
  start := GetTickCount64();
  for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
  for i := 0 to keyNum do
  begin
    key := Random(2147483647);
    //key := 1500000000 + Random(800000);
    sc[key shr 17].PushBack(key and $1FFFF);
  end;
  WriteLn('Populated (ticks) - ', GetTickCount64() - start);
  Assign(outFile, 'out.txt');
  Rewrite(outFile);
 
  offset := -bsz;
  uniq := 0;
  for i := Low(sc) to High(sc) do
  begin
    Inc(offset, bsz);
    pkey := -1;
    cnt := 0;
    if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
    for key in sc[i] do
    begin
      if pkey <> key then
      begin
        if cnt <> 0 then
        begin
          WriteLn(outFile, offset + pkey, ' - ', cnt);
          Inc(uniq);
        end;
        pkey := key;
        cnt := 0;
      end;
      Inc(cnt);
    end;
    if cnt <> 0 then
    begin
      WriteLn(outFile, offset + pkey, ' - ', cnt);
      Inc(uniq);
    end;
  end;
 
  Close(outFile);
  for i := Low(sc) to High(sc) do sc[i].Destroy;
  WriteLn('Total (ticks) - ', GetTickCount64() - start);
  WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  ReadLn;
end.

Output

Code: Text [Select][+]

Populated (ticks) - 3853
Total (ticks) - 15397
Uniq keys - 9976566, out of - 10000000

Title: Re: Sorting and Counting
Post by: Thaddy on July 20, 2019, 03:05:54 pm

Glad to see 440bx used my suggestion to use memory mapped files. This can also be done for unix-likes with fpmmap/fpmflush/fpmunmap. It should render a similar speed increase (as I explained)

Title: Re: Sorting and Counting
Post by: VTwin on July 20, 2019, 06:30:18 pm

Quote from: Akira1364 on July 20, 2019, 02:51:29 am

TGOrdinalArrayHelper itself does also have a QuickSort, BTW. Also IntroSort and "DualPivotQuickSort."

Impressive. I see the default is an Introsort. I tried my own home-rolled Quicksort, Heapsort, and DualPivotQuicksort, to see how they held up. The Quicksort does well, oddly the DualPivotQuicksort does worse.

It goes to show the value of a well crafted generics library, no tambourines required. I have some catch up reading to do on generics. Kudos to avk.

Code: Pascal [Select][+]

repeatMillionsCount = 20
    Howard time: 7.1940 #unique: 800000 #total: 20000000
    VTHeap time: 17.262 #unique: 800000 #total: 20000000
   VTQuick time: 6.5210 #unique: 800000 #total: 20000000
 VTDPQuick time: 8.6090 #unique: 800000 #total: 20000000

Title: Re: Sorting and Counting
Post by: avk on July 21, 2019, 07:03:41 am

@VTwin, thanks.
It seems DualPivotQuicksort is quit good when the input array contains only a few repetitive values.

@ASerge, I'm not sure about rtl-generics, but LGenerics is definitely incompatible with FPC 3.0.4.

@julkas, I replaced the SortCountJulkas1 code with the one of your new version.

@ 440bx, I used your new WinSortCount3, but I turned off I/O checking.
Benchmark is compiled with a 32-bit compiler and runs on 64-bit Windows7.

code:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, garrayutils, gvector, gmap,
  WinSortCount3;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TProcedureArray = array of procedure;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique, repeatCount, randomrange: Integer;
  Start: TDateTime;
  inFilename: String = 'data.txt';
  outFilename: String = 'sorted.txt';
  routineName: String;
  procedures: TProcedureArray;
  proc: procedure;
 
  procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  var
    InFile: Text;
    I: LongInt;
  begin
    Assign(InFile, inFilename);
    Rewrite(InFile);
    for I := 1 to repeatMillionsCount * 1000000 do
      WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
    Close(InFile);
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    routineName := {$I %currentroutine%};
    Map := TDictionary<LongInt, LongInt>.Create();
    //Map.Capacity := 10000000;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do begin
      {$I-}ReadLn(InOut, I);{$I+}
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, outFilename);
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      {$I-}WriteLn(InOut, Key, ' - ', Value);{$I+}
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    routineName := {$I %currentroutine%};
    AssignFile(textf, inFilename);
    Reset(textf);
    while not EOF(textf) do
      begin
        {$I-}ReadLn(textf, i);{$I+}
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, outFilename);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            {$I-}WriteLn(textf, i+min, ' - ', arr[i]); {$I+}
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    routineName := {$I %currentroutine%};
    Counter := CountRef;
    Counter.LoadFactor := 0.7;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        {$I-}ReadLn(InOut, I);{$I+}
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            {$I-}WriteLn(InOut, Key, ' - ', Count);{$I+}
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    routineName := {$I %currentroutine%};
    Assign(InOut, inFilename);
    Reset(InOut);
    SetLength(List, 4096);
    I := 0;
    while not EOF(InOut) do
      begin
        {$I-}ReadLn(InOut, J);{$I+}
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, outFilename);
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      {$I-}WriteLn(InOut, J, ' - ', DupCount);{$I+}
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas1;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  const
    bsz = 1 shl 17;
  var
    sc: array[0..2147483647 shr 17] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        {$I-}ReadLn(InOut, key);{$I+}
        Inc(Total);
        sc[key shr 17].PushBack(key and $1FFFF);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    offset := -bsz;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, bsz);
      pkey := -1;
      cnt := 0;
      if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do sc[i].Destroy;
  end;
 
  procedure SortCountJulkas2;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  var
    sc: array[0..21474] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        {$I-}ReadLn(InOut, key);{$I+}
        Inc(Total);
        sc[key div 100000].PushBack(key mod 100000);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
 
    offset := -100000;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, 100000);
      pkey := -1;
      cnt := 0;
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        {$I-}WriteLn(InOut, offset + pkey, ' - ', cnt);{$I+}
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do
      sc[i].Destroy;
  end;
 
  procedure SortCount440bx;
  begin
    routineName := {$I %currentroutine%};
    WinSortCount3.DataCount := 0;
    WinSortCount3.Unique := 0;
    WinSortCount3.InFileName := inFilename;
    WinSortCount3.OutFileName := outFilename;
    WinSortCount3.SortCount;
    Total := WinSortCount3.DataCount;
    Unique := WinSortCount3.Unique;
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,' #unique: ',Unique,' #total: ',Total);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  Randomize;
 
  procedures := TProcedureArray.Create(
                   SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
                   SortCountAvk2, SortCount440bx);
 
  for randomrange := 1 to 10 do
    begin
      GenerateData(randomrange);
      WriteLn(#10'RandomRange = ',randomrange);
      for proc in procedures do
        Run(proc);
    end;
 
  for repeatCount := 1 to 10 do
    begin
      GenerateData(8, 2*repeatCount);
      WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
      for proc in procedures do
        Run(proc);
    end;
end.
 

output:

Code: Text [Select][+]

 
RandomRange = 1
Julkas1's time: 0.7800 #unique: 100000 #total: 2000000
Julkas2's time: 0.7650 #unique: 100000 #total: 2000000
  Akira's time: 0.7170 #unique: 100000 #total: 2000000
 Howard's time: 0.9830 #unique: 100000 #total: 2000000
   Avk1's time: 0.6400 #unique: 100000 #total: 2000000
   Avk2's time: 0.4990 #unique: 100000 #total: 2000000
  440bx's time: 0.4990 #unique: 100000 #total: 2000000
 
RandomRange = 2
Julkas1's time: 0.7800 #unique: 199994 #total: 2000000
Julkas2's time: 0.8270 #unique: 199994 #total: 2000000
  Akira's time: 0.8730 #unique: 199994 #total: 2000000
 Howard's time: 1.0450 #unique: 199994 #total: 2000000
   Avk1's time: 0.7650 #unique: 199994 #total: 2000000
   Avk2's time: 0.5300 #unique: 199994 #total: 2000000
  440bx's time: 0.5460 #unique: 199994 #total: 2000000
 
RandomRange = 3
Julkas1's time: 0.8270 #unique: 299625 #total: 2000000
Julkas2's time: 0.8210 #unique: 299625 #total: 2000000
  Akira's time: 0.9750 #unique: 299625 #total: 2000000
 Howard's time: 1.0570 #unique: 299625 #total: 2000000
   Avk1's time: 0.7830 #unique: 299625 #total: 2000000
   Avk2's time: 0.5650 #unique: 299625 #total: 2000000
  440bx's time: 0.5920 #unique: 299625 #total: 2000000
 
RandomRange = 4
Julkas1's time: 0.8330 #unique: 397259 #total: 2000000
Julkas2's time: 0.8300 #unique: 397259 #total: 2000000
  Akira's time: 1.0660 #unique: 397259 #total: 2000000
 Howard's time: 1.0970 #unique: 397259 #total: 2000000
   Avk1's time: 0.8480 #unique: 397259 #total: 2000000
   Avk2's time: 0.5800 #unique: 397259 #total: 2000000
  440bx's time: 0.6270 #unique: 397259 #total: 2000000
 
RandomRange = 5
Julkas1's time: 0.8630 #unique: 490899 #total: 2000000
Julkas2's time: 0.8600 #unique: 490899 #total: 2000000
  Akira's time: 1.1640 #unique: 490899 #total: 2000000
 Howard's time: 1.1590 #unique: 490899 #total: 2000000
   Avk1's time: 0.9890 #unique: 490899 #total: 2000000
   Avk2's time: 0.6720 #unique: 490899 #total: 2000000
  440bx's time: 0.6890 #unique: 490899 #total: 2000000
 
RandomRange = 6
Julkas1's time: 0.9110 #unique: 578653 #total: 2000000
Julkas2's time: 0.9180 #unique: 578653 #total: 2000000
  Akira's time: 1.3110 #unique: 578653 #total: 2000000
 Howard's time: 1.2000 #unique: 578653 #total: 2000000
   Avk1's time: 0.9590 #unique: 578653 #total: 2000000
   Avk2's time: 0.6510 #unique: 578653 #total: 2000000
  440bx's time: 0.6900 #unique: 578653 #total: 2000000
 
RandomRange = 7
Julkas1's time: 0.9810 #unique: 659394 #total: 2000000
Julkas2's time: 0.9070 #unique: 659394 #total: 2000000
  Akira's time: 1.3500 #unique: 659394 #total: 2000000
 Howard's time: 1.2490 #unique: 659394 #total: 2000000
   Avk1's time: 0.9820 #unique: 659394 #total: 2000000
   Avk2's time: 0.7210 #unique: 659394 #total: 2000000
  440bx's time: 0.6840 #unique: 659394 #total: 2000000
 
RandomRange = 8
Julkas1's time: 0.9530 #unique: 734160 #total: 2000000
Julkas2's time: 0.9200 #unique: 734160 #total: 2000000
  Akira's time: 1.4140 #unique: 734160 #total: 2000000
 Howard's time: 1.3470 #unique: 734160 #total: 2000000
   Avk1's time: 1.0500 #unique: 734160 #total: 2000000
   Avk2's time: 0.7050 #unique: 734160 #total: 2000000
  440bx's time: 0.6900 #unique: 734160 #total: 2000000
 
RandomRange = 9
Julkas1's time: 0.9410 #unique: 802680 #total: 2000000
Julkas2's time: 0.9270 #unique: 802680 #total: 2000000
  Akira's time: 1.5210 #unique: 802680 #total: 2000000
 Howard's time: 1.2700 #unique: 802680 #total: 2000000
   Avk1's time: 1.0660 #unique: 802680 #total: 2000000
   Avk2's time: 0.7350 #unique: 802680 #total: 2000000
  440bx's time: 0.7080 #unique: 802680 #total: 2000000
 
RandomRange = 10
Julkas1's time: 0.9380 #unique: 864917 #total: 2000000
Julkas2's time: 0.9580 #unique: 864917 #total: 2000000
  Akira's time: 1.5790 #unique: 864917 #total: 2000000
 Howard's time: 1.3090 #unique: 864917 #total: 2000000
   Avk1's time: 1.1260 #unique: 864917 #total: 2000000
   Avk2's time: 0.7490 #unique: 864917 #total: 2000000
  440bx's time: 0.7210 #unique: 864917 #total: 2000000
 
repeatMillionsCount = 2
Julkas1's time: 0.9690 #unique: 734280 #total: 2000000
Julkas2's time: 0.9640 #unique: 734280 #total: 2000000
  Akira's time: 1.4350 #unique: 734280 #total: 2000000
 Howard's time: 1.3070 #unique: 734280 #total: 2000000
   Avk1's time: 1.0430 #unique: 734280 #total: 2000000
   Avk2's time: 0.7050 #unique: 734280 #total: 2000000
  440bx's time: 0.6900 #unique: 734280 #total: 2000000
 
repeatMillionsCount = 4
Julkas1's time: 1.6610 #unique: 794661 #total: 4000000
Julkas2's time: 1.6720 #unique: 794661 #total: 4000000
  Akira's time: 2.2260 #unique: 794661 #total: 4000000
 Howard's time: 2.4500 #unique: 794661 #total: 4000000
   Avk1's time: 1.7420 #unique: 794661 #total: 4000000
   Avk2's time: 1.1830 #unique: 794661 #total: 4000000
  440bx's time: 1.2800 #unique: 794661 #total: 4000000
 
repeatMillionsCount = 6
Julkas1's time: 2.3830 #unique: 799556 #total: 6000000
Julkas2's time: 2.3840 #unique: 799556 #total: 6000000
  Akira's time: 2.9000 #unique: 799556 #total: 6000000
 Howard's time: 3.2960 #unique: 799556 #total: 6000000
   Avk1's time: 2.3930 #unique: 799556 #total: 6000000
   Avk2's time: 1.6810 #unique: 799556 #total: 6000000
  440bx's time: 1.8210 #unique: 799556 #total: 6000000
 
repeatMillionsCount = 8
Julkas1's time: 3.1210 #unique: 799967 #total: 8000000
Julkas2's time: 3.1040 #unique: 799967 #total: 8000000
  Akira's time: 3.6550 #unique: 799967 #total: 8000000
 Howard's time: 4.3260 #unique: 799967 #total: 8000000
   Avk1's time: 3.0250 #unique: 799967 #total: 8000000
   Avk2's time: 2.1500 #unique: 799967 #total: 8000000
  440bx's time: 2.3370 #unique: 799967 #total: 8000000
 
repeatMillionsCount = 10
Julkas1's time: 3.8410 #unique: 799997 #total: 10000000
Julkas2's time: 3.8310 #unique: 799997 #total: 10000000
  Akira's time: 4.3570 #unique: 799997 #total: 10000000
 Howard's time: 5.3620 #unique: 799997 #total: 10000000
   Avk1's time: 3.6840 #unique: 799997 #total: 10000000
   Avk2's time: 2.6500 #unique: 799997 #total: 10000000
  440bx's time: 2.8860 #unique: 799997 #total: 10000000
 
repeatMillionsCount = 12
Julkas1's time: 4.5800 #unique: 799999 #total: 12000000
Julkas2's time: 4.5300 #unique: 799999 #total: 12000000
  Akira's time: 5.1050 #unique: 799999 #total: 12000000
 Howard's time: 6.3740 #unique: 799999 #total: 12000000
   Avk1's time: 4.4770 #unique: 799999 #total: 12000000
   Avk2's time: 3.2540 #unique: 799999 #total: 12000000
  440bx's time: 3.4680 #unique: 799999 #total: 12000000
 
repeatMillionsCount = 14
Julkas1's time: 5.3170 #unique: 800000 #total: 14000000
Julkas2's time: 5.3510 #unique: 800000 #total: 14000000
  Akira's time: 6.1320 #unique: 800000 #total: 14000000
 Howard's time: 7.5900 #unique: 800000 #total: 14000000
   Avk1's time: 5.2320 #unique: 800000 #total: 14000000
   Avk2's time: 3.5410 #unique: 800000 #total: 14000000
  440bx's time: 3.9310 #unique: 800000 #total: 14000000
 
repeatMillionsCount = 16
Julkas1's time: 6.4290 #unique: 800000 #total: 16000000
Julkas2's time: 6.0400 #unique: 800000 #total: 16000000
  Akira's time: 6.5250 #unique: 800000 #total: 16000000
 Howard's time: 8.4090 #unique: 800000 #total: 16000000
   Avk1's time: 5.5970 #unique: 800000 #total: 16000000
   Avk2's time: 3.9940 #unique: 800000 #total: 16000000
  440bx's time: 4.4460 #unique: 800000 #total: 16000000
 
repeatMillionsCount = 18
Julkas1's time: 6.7460 #unique: 800000 #total: 18000000
Julkas2's time: 6.7230 #unique: 800000 #total: 18000000
  Akira's time: 7.2390 #unique: 800000 #total: 18000000
 Howard's time: 9.3290 #unique: 800000 #total: 18000000
   Avk1's time: 6.4050 #unique: 800000 #total: 18000000
   Avk2's time: 4.8370 #unique: 800000 #total: 18000000
  440bx's time: 5.3630 #unique: 800000 #total: 18000000
 
repeatMillionsCount = 20
Julkas1's time: 7.6870 #unique: 800000 #total: 20000000
Julkas2's time: 7.5800 #unique: 800000 #total: 20000000
  Akira's time: 8.2640 #unique: 800000 #total: 20000000
 Howard's time: 10.4280 #unique: 800000 #total: 20000000
   Avk1's time: 6.8850 #unique: 800000 #total: 20000000
   Avk2's time: 4.9860 #unique: 800000 #total: 20000000
  440bx's time: 5.5030 #unique: 800000 #total: 20000000
 

Title: Re: Sorting and Counting
Post by: jamie on July 21, 2019, 06:54:42 pm

I would like to get in on this adventure myself :D

However, you guys wouldn't like my version because it would involve using Assembler code. >:D

Title: Re: Sorting and Counting
Post by: julkas on July 21, 2019, 07:17:42 pm

Quote from: jamie on July 21, 2019, 06:54:42 pm

I would like to get in on this adventure myself :D

However, you guys wouldn't like my version because it would involve using Assembler code. >:D

Go on. You are welcome. ;)

Title: Re: Sorting and Counting
Post by: 440bx on July 21, 2019, 08:50:36 pm

Quote from: avk on July 21, 2019, 07:03:41 am

Benchmark is compiled with a 32-bit compiler and runs on 64-bit Windows7.

Thank you.

If it isn't too much trouble, I'd like to see the results when compiled for 64bit.

Title: Re: Sorting and Counting
Post by: VTwin on July 21, 2019, 10:06:38 pm

Quote from: jamie on July 21, 2019, 06:54:42 pm

I would like to get in on this adventure myself :D

However, you guys wouldn't like my version because it would involve using Assembler code. >:D

I'd be interested, as long as it is simple and cross-platform. :D

Title: Re: Sorting and Counting
Post by: ASerge on July 22, 2019, 12:04:16 am

Quote from: avk on July 21, 2019, 07:03:41 am

@ASerge, I'm not sure about rtl-generics, but LGenerics is definitely incompatible with FPC 3.0.4.

OK. What about project with FPC 3.3.1?

Title: Re: Sorting and Counting
Post by: avk on July 22, 2019, 02:44:36 am

Quote from: ASerge on July 22, 2019, 12:04:16 am

OK. What about project with FPC 3.3.1?

And what's wrong with the project for 3.3.1?

@ 440BX
64-bit version, it seems you won :):

Code: Text [Select][+]

RandomRange = 5
Julkas1's time: 0.7330  #unique: 490717 #total: 2000000
Julkas2's time: 0.7490  #unique: 490717 #total: 2000000
  Akira's time: 0.9050  #unique: 490717 #total: 2000000
 Howard's time: 1.1230  #unique: 490717 #total: 2000000
   Avk1's time: 0.8420  #unique: 490717 #total: 2000000
   Avk2's time: 0.5780  #unique: 490717 #total: 2000000
  440bx's time: 0.4520  #unique: 490717 #total: 2000000
 
RandomRange = 6
Julkas1's time: 0.7640  #unique: 578502 #total: 2000000
Julkas2's time: 0.7490  #unique: 578502 #total: 2000000
  Akira's time: 0.9510  #unique: 578502 #total: 2000000
 Howard's time: 1.0770  #unique: 578502 #total: 2000000
   Avk1's time: 0.8890  #unique: 578502 #total: 2000000
   Avk2's time: 0.5930  #unique: 578502 #total: 2000000
  440bx's time: 0.5150  #unique: 578502 #total: 2000000
 
RandomRange = 7
Julkas1's time: 0.7800  #unique: 659891 #total: 2000000
Julkas2's time: 0.7800  #unique: 659891 #total: 2000000
  Akira's time: 0.9830  #unique: 659891 #total: 2000000
 Howard's time: 1.1890  #unique: 659891 #total: 2000000
   Avk1's time: 0.9530  #unique: 659891 #total: 2000000
   Avk2's time: 0.6490  #unique: 659891 #total: 2000000
  440bx's time: 0.4960  #unique: 659891 #total: 2000000
 
RandomRange = 8
Julkas1's time: 0.7990  #unique: 734164 #total: 2000000
Julkas2's time: 0.8200  #unique: 734164 #total: 2000000
  Akira's time: 1.0450  #unique: 734164 #total: 2000000
 Howard's time: 1.2610  #unique: 734164 #total: 2000000
   Avk1's time: 1.1300  #unique: 734164 #total: 2000000
   Avk2's time: 0.6810  #unique: 734164 #total: 2000000
  440bx's time: 0.4940  #unique: 734164 #total: 2000000
 
RandomRange = 9
Julkas1's time: 0.9130  #unique: 802348 #total: 2000000
Julkas2's time: 1.0760  #unique: 802348 #total: 2000000
  Akira's time: 1.3740  #unique: 802348 #total: 2000000
 Howard's time: 1.2440  #unique: 802348 #total: 2000000
   Avk1's time: 1.0330  #unique: 802348 #total: 2000000
   Avk2's time: 0.6590  #unique: 802348 #total: 2000000
  440bx's time: 0.5400  #unique: 802348 #total: 2000000
 
RandomRange = 10
Julkas1's time: 0.8330  #unique: 864249 #total: 2000000
Julkas2's time: 0.8140  #unique: 864249 #total: 2000000
  Akira's time: 1.1120  #unique: 864249 #total: 2000000
 Howard's time: 1.1760  #unique: 864249 #total: 2000000
   Avk1's time: 1.0670  #unique: 864249 #total: 2000000
   Avk2's time: 0.6760  #unique: 864249 #total: 2000000
  440bx's time: 0.5330  #unique: 864249 #total: 2000000
 
repeatMillionsCount = 10
Julkas1's time: 4.4350  #unique: 5709324 #total: 10000000
Julkas2's time: 4.4840  #unique: 5709324 #total: 10000000
  Akira's time: 6.5020  #unique: 5709324 #total: 10000000
 Howard's time: 6.8220  #unique: 5709324 #total: 10000000
   Avk1's time: 6.0550  #unique: 5709324 #total: 10000000
   Avk2's time: 4.4340  #unique: 5709324 #total: 10000000
  440bx's time: 3.8650  #unique: 5709324 #total: 10000000
 
repeatMillionsCount = 12
Julkas1's time: 5.2520  #unique: 6216581 #total: 12000000
Julkas2's time: 5.3630  #unique: 6216581 #total: 12000000
  Akira's time: 7.6030  #unique: 6216581 #total: 12000000
 Howard's time: 7.6950  #unique: 6216581 #total: 12000000
   Avk1's time: 7.4680  #unique: 6216581 #total: 12000000
   Avk2's time: 5.1700  #unique: 6216581 #total: 12000000
  440bx's time: 4.4730  #unique: 6216581 #total: 12000000
 
repeatMillionsCount = 14
Julkas1's time: 5.9630  #unique: 6609319 #total: 14000000
Julkas2's time: 5.9450  #unique: 6609319 #total: 14000000
  Akira's time: 8.7800  #unique: 6609319 #total: 14000000
 Howard's time: 8.9000  #unique: 6609319 #total: 14000000
   Avk1's time: 8.1640  #unique: 6609319 #total: 14000000
   Avk2's time: 5.8020  #unique: 6609319 #total: 14000000
  440bx's time: 5.0030  #unique: 6609319 #total: 14000000
 
repeatMillionsCount = 16
Julkas1's time: 6.6950  #unique: 6917359 #total: 16000000
Julkas2's time: 6.6010  #unique: 6917359 #total: 16000000
  Akira's time: 9.5630  #unique: 6917359 #total: 16000000
 Howard's time: 9.8050  #unique: 6917359 #total: 16000000
   Avk1's time: 8.9960  #unique: 6917359 #total: 16000000
   Avk2's time: 6.3830  #unique: 6917359 #total: 16000000
  440bx's time: 5.5600  #unique: 6917359 #total: 16000000
 
repeatMillionsCount = 18
Julkas1's time: 7.2430  #unique: 7157162 #total: 18000000
Julkas2's time: 7.2440  #unique: 7157162 #total: 18000000
  Akira's time: 10.1980 #unique: 7157162 #total: 18000000
 Howard's time: 11.2310 #unique: 7157162 #total: 18000000
   Avk1's time: 9.6480  #unique: 7157162 #total: 18000000
   Avk2's time: 7.1490  #unique: 7157162 #total: 18000000
  440bx's time: 5.9770  #unique: 7157162 #total: 18000000
 
repeatMillionsCount = 20
Julkas1's time: 8.1910  #unique: 7343071 #total: 20000000
Julkas2's time: 7.9810  #unique: 7343071 #total: 20000000
  Akira's time: 11.0300 #unique: 7343071 #total: 20000000
 Howard's time: 12.1460 #unique: 7343071 #total: 20000000
   Avk1's time: 10.5860 #unique: 7343071 #total: 20000000
   Avk2's time: 7.9840  #unique: 7343071 #total: 20000000
  440bx's time: 6.5130  #unique: 7343071 #total: 20000000
 

Please excuse me, but for some time I will not be able to attend the forum.

Title: Re: Sorting and Counting
Post by: 440bx on July 22, 2019, 03:04:54 am

Quote from: avk on July 22, 2019, 02:44:36 am

@ 440BX
64-bit version, it seems you won :):

but only for 64bit. In 32bit, you win :). The cost of pushing and popping parameters is just too high.

Title: Re: Sorting and Counting
Post by: ASerge on July 22, 2019, 12:09:02 pm

Quote from: avk on July 22, 2019, 02:44:36 am

And what's wrong with the project for 3.3.1?

Attach it , please.

Title: Re: Sorting and Counting
Post by: julkas on July 23, 2019, 02:57:05 pm

I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.
So my fcl-stl TVector solution is better than Akira's generics.collections TDictionary and better than generics.collections TList.
Strange ... Who can explain this ? :-(

Quote from: Xor-el on July 13, 2019, 11:49:10 am

Also generics.collections is way faster and more modern ...

Is it true ?

Title: Re: Sorting and Counting
Post by: Thaddy on July 23, 2019, 06:58:52 pm

Usually, yes, but mind the remarks I made in the other thread: it is about some internals. In this thread somebody used those low-level adjustments. See if you can spot who did... :-X
The uses clause is usually a dead giveaway.... Analyze the inheritance...

Title: Re: Sorting and Counting
Post by: julkas on July 23, 2019, 08:32:25 pm

Quote from: Thaddy on July 23, 2019, 06:58:52 pm

Usually, yes, but mind the remarks I made in the other thread: it is about some internals. In this thread somebody used those low-level adjustments. See if you can spot who did... :-X
The uses clause is usually a dead giveaway.... Analyze the inheritance...

In this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
In real life I can't use even fcl-stl. 80% of my Pascal code (DS, algorithm) is written from scratch.

Title: Re: Sorting and Counting
Post by: 440bx on July 23, 2019, 09:10:19 pm

Quote from: julkas on July 23, 2019, 08:32:25 pm

I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.

You can legitimately say that about the algorithms you used to solve the problem, I _cannot_ say that about the implementations I presented, I traded cleanliness for speed.

At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest. Its downside is, in some cases, it can take a lot more memory than desirable.

When everything is taken into account, I believe Avk implementation number 2 is the best one.

Title: Re: Sorting and Counting
Post by: VTwin on July 23, 2019, 09:43:56 pm

Quote from: julkas on July 23, 2019, 08:32:25 pm

In this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
In real life I can't use even fcl-stl. 80% of my Pascal code (DS, algorithm) is written from scratch.

I don't have (ready) access to a generics library, so I just replaced one line of code in Avk2. Not a speed record or new suggestion, but it runs pretty quick. I believe Avk2 uses an Introsort.

Code: Pascal [Select][+]

procedure Quicksort(var a: IVector; left, right: integer);
var
  i, j: integer;
  x: integer;
begin
  if (right <= left) or (left < 0) then
    exit;
  if (right - left + 1) < MinQSortElem then begin
    Insertionsort(a, left, right);
    exit;
  end;
  i := left;
  j := right;
  x := a[(left + right) div 2];
  repeat
    while (Compare(a[i], x) = -1) do
      i := i + 1;
    while (Compare(x, a[j]) = -1) do
      j := j - 1;
    if i <= j then begin
      Swap(a[i], a[j]);
      i := i + 1;
      j := j - 1;
    end;
  until i > j;
  if left < j then
    Quicksort(a, left, j);
  if i < right then
    Quicksort(a, i, right);
end;  

Of course the Insertionsort is unnecessary here, and Compare can be replaced with "<".

Title: Re: Sorting and Counting
Post by: ASerge on July 23, 2019, 09:59:13 pm

Quote from: 440bx on July 23, 2019, 09:10:19 pm

At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest.

+1
That's why the results seem strange to me, because Howard's algorithm is O(n), and algorithms with sorts is O(n*lg(n)). But it is impossible to check, @avk did not provide the project.

Title: Re: Sorting and Counting
Post by: howardpc on July 23, 2019, 11:30:03 pm

Quote from: 440bx on July 23, 2019, 09:10:19 pm

Quote from: julkas on July 23, 2019, 08:32:25 pm
I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.
At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest. Its downside is, in some cases, it can take a lot more memory than desirable.

Replacing the Format() in my original implementation with a simple Writeln() gives a slight speed increase.
However my implementation has two drawbacks apart from initially high memory usage:

it requires two passes over the data (because the min and max values must be found in order to dimension the array)
if the spread of values is high, or the scattering of values is particularly sparse, the memory usage goes very high - so it's performance and memory usage is very data-dependent

Title: Re: Sorting and Counting
Post by: 440bx on July 23, 2019, 11:30:22 pm

Quote from: ASerge on July 23, 2019, 09:59:13 pm

That's why the results seem strange to me, because Howard's algorithm is O(n), and algorithms with sorts is O(n*lg(n)). But it is impossible to check, @avk did not provide the project.

They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.

Here are some of the costs his algorithm incurs, some of which could be avoided and others simply can't.

Code: Pascal [Select][+]

    min: Integer = High(Integer);
    max: Integer = -1;
  begin
   routineName := {$I %currentroutine%};
   AssignFile(textf, inFilename);
   Reset(textf);
   while not EOF(textf) do
    begin
       ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
     end;
 

He has to walk the entire data file to determine min and max. If in the above snippet of code, both min and max had been initialized to the first element in the data file then it would not be necessary to always do two (2) comparisons (one against min and another one against max.) It could have been coded instead as

Code: Pascal [Select][+]

        if i < min then
        begin
          min := i;
          continue;
        end;
        if i > max then
          max := i;
 

Thereby, for elements lower than the current min, avoiding a now unnecessary second comparison against max. The gains would, of course, depend on the file's data arrangement.

The instruction

Code: Pascal [Select][+]

    SetLength(arr, max-min+1);   

is very expensive because it has to set (max - min + 1) number of elements - which, due to ~~the large number of duplicates in the data file and~~ their large range, will be significantly greater than the number of elements in the data file - to zero.

Directly related to the above,

Code: Pascal [Select][+]

      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, i+min, ' - ', arr[i]);
            Inc(Unique);
          end;
      end;
 

The number of comparisons needed to "weed out" superfluous buckets (=0) is large (due to the large min, max range) and the large number of duplicates.

The larger number of comparisons due to the data range and the need to zero out a large memory area, is enough to nullify the gains of his O(n) algorithm.

ETA:

@Howard

Our posts crossed. You are absolutely right.

Title: Re: Sorting and Counting
Post by: ASerge on July 23, 2019, 11:45:06 pm

Quote from: 440bx on July 23, 2019, 11:30:22 pm

They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.

But for large n it's better. And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.

Title: Re: Sorting and Counting
Post by: 440bx on July 24, 2019, 12:05:14 am

Quote from: ASerge on July 23, 2019, 11:45:06 pm

Quote from: 440bx on July 23, 2019, 11:30:22 pm
They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.
But for large n it's better.

Yes, provided that the range is reasonably close to n. As the ratio of range/n increases, a radix sort suffers.

Quote from: ASerge on July 23, 2019, 11:45:06 pm

And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.

True, the concern is, once the algorithm uses knowledge about the data format it didn't determine itself, the ~~algorithm~~ implementation may lose generality.

Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 04:18:25 am

Quote from: julkas on July 23, 2019, 02:57:05 pm

I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.

Just curious how much performance has degraded?

Quote from: VTwin on July 23, 2019, 09:43:56 pm

... I believe Avk2 uses an Introsort.

All LGArrayHelpers sorting algorithms are hybrid, in particular TGOrdinalArrayHelper.Sort tries to use
Counting Sort, if possible, otherwise uses Introsort.

Quote from: ASerge on July 23, 2019, 09:59:13 pm

... But it is impossible to check, @avk did not provide the project.

Let me guess, the project of which you always mention this is LGenerics? If so, see attachment.

Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 06:23:04 am

Just in case, I decided to check the coincidence of the results of the existing solutions.
A curious fact, all coincide, except SortCount440bx.

Quote from: julkas on July 23, 2019, 02:57:05 pm

...So my fcl-stl TVector solution is better than Akira's generics.collections TDictionary...

This is highly dependent on the input data.

Title: Re: Sorting and Counting
Post by: 440bx on July 31, 2019, 06:44:57 am

Quote from: avk on July 31, 2019, 06:23:04 am

A curious fact, all coincide, except SortCount440bx.

that's because of the compare function. The number of instances of each number is correct but, the collating sequence is different than what is obtained in a numerical comparison.

Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 06:56:08 am

@440bx, thank you, now I understand the reason.

Title: Re: Sorting and Counting
Post by: hnb on July 31, 2019, 09:13:58 am

Hello,

First : rtl-generics TDictionary seems like strange choice. rtl-generics has better collection for such purpose (sort + hash map). IMO should be used TSortedHashSet<T> with custom AComparer and AEqualityComparer, with using Capacity property.

Secondly: @Akira version for rtl-generics TDictionary is not optimized and has many redundant calls (also in second @avk test the Capacity is disabled which can decrease performance):

Code: Pascal [Select][+]

      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
 

~~such code can be optimized, it can be reduced to something like this:~~

Code: Pascal [Select][+]

var
  Counters: array of LongInt;
  P: PLongInt;
  Map: TDictionary<LongInt, PLongInt>;
 
...
 
      if not Map.TryGetValue(I, P) then
        begin
          P := @Counters[Unique];
          Map.Add(I, P);
          Inc(Unique);
        end;
      Inc(P^);
 

~~Above code will be much faster. The "reads/lookups" from dictionary is reduced to one, or to two during adding instead of 3 and 4 "reads/lookups".~~

Thirdly: for many reads from dictionary worth to test other dictionary from rtl-generics : TCuckooD2<TKey, TValue> / TFastHashMap<TKey, TValue>. Cuckoo should be faster for such case.

Title: Re: Sorting and Counting
Post by: julkas on July 31, 2019, 09:38:50 am

Quote from: avk on July 31, 2019, 04:18:25 am

Quote from: julkas on July 23, 2019, 02:57:05 pm
I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.
Just curious how much performance has degraded?

Hello @avk, @hnb ! (See - https://forum.lazarus.freepascal.org/index.php/topic,46254.0.html)
TList Code

Code: Pascal [Select][+]

program sc2;
{$mode delphi}
 
uses SysUtils, Classes,
    //gvector, gutil, garrayutils;
    Generics.Defaults, Generics.Collections;
const
  keyNum = 10000000;
  blckSize = 100000;
//type
  //TIntLess = TLess<LongInt>;
  //TIntVect = TVector<LongInt>;
  //TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
 
var
  //sc: array[0..2147483647 div blckSize] of TIntVect;
  sc: array[0..2147483647 div blckSize] of Generics.Collections.TList<LongInt>;
  i: LongInt;
  pkey, key, cnt, uniq: LongInt;
  offset: LongInt;
  start: QWord;
  outFile: Text;
 
begin
  start := GetTickCount64();
  //for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
  for i := Low(sc) to High(sc) do sc[i] := TList<LongInt>.Create;
 
  for i := 0 to keyNum do
  begin
    key := Random(2147483647);
    //sc[key div blckSize].PushBack(key mod blckSize);
    sc[key div 100000].Add(key mod 100000);
  end;
  WriteLn('Populated (ticks) - ', GetTickCount64() - start);
 
  Assign(outFile, 'out.txt');
  Rewrite(outFile);
 
  //for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
  for i := Low(sc) to High(sc) do if sc[i].Count > 1 then sc[i].Sort();
  offset := -blckSize;
  uniq := 0;
  for i := Low(sc) to High(sc) do
  begin
    Inc(offset, blckSize);
    pkey := -1;
    cnt := 0;
    for key in sc[i] do
    begin
      if pkey <> key then
      begin
        if cnt <> 0 then
        begin
          WriteLn(outFile, offset + pkey, ' - ', cnt);
          Inc(uniq);
        end;
        pkey := key;
        cnt := 0;
      end;
      Inc(cnt);
    end;
    if cnt <> 0 then
    begin
    WriteLn(outFile, offset + pkey, ' - ', cnt);
    Inc(uniq);
    end;
  end;
 
  Close(outFile);
  for i := Low(sc) to High(sc) do
    sc[i].Destroy;
 
  WriteLn('Total (ticks) - ', GetTickCount64() - start);
  WriteLn('Uniq keys - ', uniq, ', out of - ', keyNum);
  ReadLn;
end.
 

TList output -

Code: Text [Select][+]

Populated (ticks) - 3282
Total (ticks) - 7078
Uniq keys - 9976566, out of - 10000000

TVector output -

Code: Text [Select][+]

Populated (ticks) - 1328
Total (ticks) - 4718
Uniq keys - 9976566, out of - 10000000

Title: Re: Sorting and Counting
Post by: avk on July 31, 2019, 11:27:14 am

@hnb, you are welcome :)
@julkas, I believe that the slowdown is mainly due to notifications and slower sorting in Generics.Collections.TList. The last time I compared, TArrayHelper.Sort was 1.5-1.6 times slower than TOrderingArrayUtils.Sort.

Title: Re: Sorting and Counting
Post by: BrunoK on July 31, 2019, 05:13:28 pm

Let's go with my entry, if it works and I have correctly solved the initial question.
All very traditional object pascal.

Code: Pascal [Select][+]

 
  procedure SortCountBrunoK;  { Note : requires Classes }
  const
    cCR = $0D;
    c0 = Ord('0');
    function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
    var
      { Parse lines }
      lPByte, lPEndByte: PByte;
 
      { Values extraction }
      lPByteTextStart: PByte = nil;
      lValueStarted: boolean = False;
      lDWORD: DWORD;
 
      lCntRec: integer;
    begin
      { Prepare aList }
      lCntRec := aMemStream.Size;
      if lCntRec <= 0 then // Stream empty ?
        exit(0);
      aList.Count := lCntRec div 10; // Setup approximative size
      aList.Count := 0;
 
      lPByte := PByte(aMemStream.memory);
      lPEndByte := lPByte + aMemStream.Size;
 
      while lPByte <= lPEndByte do begin
        if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
          if lValueStarted then begin
            lDWORD := 0;
            while lPByteTextStart < lPByte do begin
              lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
              Inc(lPByteTextStart);
            end;
            aList.Add(Pointer(lDWORD));
            lValueStarted := False;
          end;
        end
        else if not lValueStarted then begin
          lPByteTextStart := lPByte;
          lValueStarted := True;
        end;
        Inc(lPByte);
      end;
      Result := aList.Count;
    end;
 
    function BkCompare(Item1, Item2: Pointer): integer;
    begin
      Result := 1;
      if Item1 < Item2 then
        Result := -1
      else if Item1 = Item2 then
        Result := 0;
    end;
 
  var
    lFile: TextFile;
    lMemStream: TMemoryStream;
    lNbRecs: integer = 0;
    lFPList: TFPList;
    lIx: integer;
    lLastValue: pointer;
    lListCount: integer;
    lLastValueCount: integer;
  begin
    routineName := 'SortCountBrunoK'; // {$I %currentroutine%};
    lMemStream := TMemoryStream.Create;
    lMemStream.LoadFromFile(inFileName);
    lFPList := TFPList.Create;
    lNbRecs := LoadStreamToList(lMemStream, lFPList);
    lMemStream.Free; // Not needed anymore
    if lNbRecs > 0 then begin
      AssignFile(lFile, outFilename);
      Rewrite(lFile);
      lFPList.Sort(@BkCompare);
      lIx := 0;
      lLastValue := lFPList[lIx];
      lLastValueCount := 1;
      lListCount := lFPList.Count;
      repeat
        Inc(lIx);
        if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
          Inc(unique);
          WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
          if (lIx >= lListCount) then
            Break;
          lLastValue := lFPList[lIx];
          lLastValueCount := 1;
        end
        else
          Inc(lLastValueCount);
      until False;
      CloseFile(lFile);
      Total := lIx;
    end;
    lFPList.Free;
  end;
 

If correct it runs 20% faster on 64 bit relative to 32 bit.

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 06:49:03 am

I have added your code to the benchmark:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  Classes, SysUtils, DateUtils,
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, garrayutils, gvector, gmap,
  WinSortCount3;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TProcedureArray = array of procedure;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique, repeatCount, randomrange: Integer;
  Start: TDateTime;
  inFilename: String = 'data.txt';
  outFilename: String = 'sorted.txt';
  routineName: String;
  procedures: TProcedureArray;
  proc: procedure;
 
  procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  var
    InFile: Text;
    I: LongInt;
  begin
    Assign(InFile, inFilename);
    Rewrite(InFile);
    for I := 1 to repeatMillionsCount * 1000000 do
      WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
    Close(InFile);
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    routineName := {$I %currentroutine%};
    Map := TDictionary<LongInt, LongInt>.Create();
    //Map.Capacity := 10000000;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, outFilename);
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    routineName := {$I %currentroutine%};
    AssignFile(textf, inFilename);
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, outFilename);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, i+min, ' - ', arr[i]);
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    routineName := {$I %currentroutine%};
    Counter := CountRef;
    //Counter.LoadFactor := 0.7;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  var
    List: array of Integer;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    routineName := {$I %currentroutine%};
    Assign(InOut, inFilename);
    Reset(InOut);
    SetLength(List, 262144);
    I := 0;
    while not EOF(InOut) do
      begin
        ReadLn(InOut, J);
        Inc(Total);
        if Length(List) = I then
          SetLength(List, I * 2);
        List[I] := J;
        Inc(I);
      end;
    Close(InOut);
    SetLength(List, I);
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Count := I;
    DupCount := 0;
    I := 0;
    Assign(InOut, outFilename);
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas1;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  const
    bsz = 1 shl 17;
  var
    sc: array[0..2147483647 shr 17] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        sc[key shr 17].PushBack(key and $1FFFF);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    offset := -bsz;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, bsz);
      pkey := -1;
      cnt := 0;
      if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            WriteLn(InOut, offset + pkey, ' - ', cnt);
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        WriteLn(InOut, offset + pkey, ' - ', cnt);
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do sc[i].Destroy;
  end;
 
  procedure SortCountJulkas2;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  var
    sc: array[0..21474] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        sc[key div 100000].PushBack(key mod 100000);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
 
    offset := -100000;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, 100000);
      pkey := -1;
      cnt := 0;
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            WriteLn(InOut, offset + pkey, ' - ', cnt);
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        WriteLn(InOut, offset + pkey, ' - ', cnt);
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do
      sc[i].Destroy;
  end;
 
  procedure SortCount440bx;
  begin
    routineName := {$I %currentroutine%};
    WinSortCount3.DataCount := 0;
    WinSortCount3.Unique := 0;
    WinSortCount3.InFileName := inFilename;
    WinSortCount3.OutFileName := outFilename;
    WinSortCount3.SortCount;
    Total := WinSortCount3.DataCount;
    Unique := WinSortCount3.Unique;
  end;
 
  procedure SortCountBrunoK;  { Note : requires Classes }
  const
    cCR = $0D;
    c0 = Ord('0');
    function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
    var
      { Parse lines }
      lPByte, lPEndByte: PByte;
 
      { Values extraction }
      lPByteTextStart: PByte = nil;
      lValueStarted: boolean = False;
      lDWORD: DWORD;
 
      lCntRec: integer;
    begin
      { Prepare aList }
      lCntRec := aMemStream.Size;
      if lCntRec <= 0 then // Stream empty ?
        exit(0);
      aList.Count := lCntRec div 10; // Setup approximative size
      aList.Count := 0;
 
      lPByte := PByte(aMemStream.memory);
      lPEndByte := lPByte + aMemStream.Size;
 
      while lPByte <= lPEndByte do begin
        if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
          if lValueStarted then begin
            lDWORD := 0;
            while lPByteTextStart < lPByte do begin
              lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
              Inc(lPByteTextStart);
            end;
            aList.Add(Pointer(lDWORD));
            lValueStarted := False;
          end;
        end
        else if not lValueStarted then begin
          lPByteTextStart := lPByte;
          lValueStarted := True;
        end;
        Inc(lPByte);
      end;
      Result := aList.Count;
    end;
 
    function BkCompare(Item1, Item2: Pointer): integer;
    begin
      Result := 1;
      if Item1 < Item2 then
        Result := -1
      else if Item1 = Item2 then
        Result := 0;
    end;
 
  var
    lFile: TextFile;
    lMemStream: TMemoryStream;
    lNbRecs: integer = 0;
    lFPList: TFPList;
    lIx: integer;
    lLastValue: pointer;
    lListCount: integer;
    lLastValueCount: integer;
  begin
    routineName := 'SortCountBrunoK'; //{$I %currentroutine%};
    lMemStream := TMemoryStream.Create;
    lMemStream.LoadFromFile(inFileName);
    lFPList := TFPList.Create;
    lNbRecs := LoadStreamToList(lMemStream, lFPList);
    lMemStream.Free; // Not needed anymore
    if lNbRecs > 0 then begin
      AssignFile(lFile, outFilename);
      Rewrite(lFile);
      lFPList.Sort(@BkCompare);
      lIx := 0;
      lLastValue := lFPList[lIx];
      lLastValueCount := 1;
      lListCount := lFPList.Count;
      repeat
        Inc(lIx);
        if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
          Inc(unique);
          WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
          if (lIx >= lListCount) then
            Break;
          lLastValue := lFPList[lIx];
          lLastValueCount := 1;
        end
        else
          Inc(lLastValueCount);
      until False;
      CloseFile(lFile);
      Total := lIx;
    end;
    lFPList.Free;
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,#9'#unique: ',Unique,' #total: ',Total);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  Randomize;
 
  procedures := TProcedureArray.Create(
                   SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
                   SortCountAvk2, SortCount440bx, SortCountBrunoK);
 
  for randomrange := 1 to 10 do
    begin
      GenerateData(randomrange, 10);
      WriteLn(#10'RandomRange = ',randomrange);
      for proc in procedures do
        Run(proc);
    end;
 
  for repeatCount := 1 to 10 do
    begin
      GenerateData(8, 2*repeatCount);
      WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
      for proc in procedures do
        Run(proc);
    end;
end.
 

At least on my machine, the performance of your code is second only to 440bx one.

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 08:23:16 am

Quote from: avk on August 01, 2019, 06:49:03 am

At least on my machine, the performance of your code is second only to 440bx one.

My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)

In the real world the different sequence would not be acceptable.

ETA:

Bruno's implementation can be made a smidgen faster by changing the compare function to test for equality first (since there are more duplicate values than unique), that would lower the number of comparisons required to determine relative magnitudes.

Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 08:58:58 am

Quote

ETA:

Bruno's implementation can be made a smidgen faster by changing the compare function to test for equality first (since there are more duplicate values than unique), that would lower the number of comparisons required to determine relative magnitudes.

I think that the compare function can be speeded up a lot by "linearizing" the code to eliminate data dependend branches like

Code: [Select]

  Result := 1 - integer( a=b ) - 2*integer( a<b );

At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 09:29:21 am

Quote from: MathMan on August 01, 2019, 08:58:58 am

Code: [Select]
Result := 1 - integer( a=b ) - 2*integer( a<b );At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan

I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 09:46:58 am

Quote from: 440bx on August 01, 2019, 08:23:16 am

My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)

And this can not be fixed?

@BrunoK, you really added fuel to the fire. :)
I wondered what would happen if I attached a similar trick to SortCountAvk2:

Code: Pascal [Select][+]

program OccurrenceCounter;
 
{$mode delphi}
{$ImplicitExceptions Off}
{$MODESWITCH NESTEDPROCVARS}
 
uses
  Classes, SysUtils, DateUtils, 
  Generics.Defaults, Generics.Collections,
  LGUtils, LGHashMultiSet, LGArrayHelpers,
  gutil, garrayutils, gvector, gmap,
  WinSortCount3;
 
type
  TIntPair = TPair<LongInt, LongInt>;
  TProcedureArray = array of procedure;
 
  function ComparePairs(constref L, R: TIntPair): LongInt;
  begin
    if L.Key < R.Key then
      Result := -1
    else if L.Key = R.Key then
      Result := 0
    else
      Result := 1;
  end;
 
var
  Total, Unique, repeatCount, randomrange: Integer;
  Start: TDateTime;
  inFilename: String = 'data.txt';
  outFilename: String = 'sorted.txt';
  routineName: String;
  procedures: TProcedureArray;
  proc: procedure;
 
  procedure GenerateData(randomRange: Integer=8; repeatMillionsCount: Integer=2);
  var
    InFile: Text;
    I: LongInt;
  begin
    Assign(InFile, inFilename);
    Rewrite(InFile);
    for I := 1 to repeatMillionsCount * 1000000 do
      WriteLn(InFile, 1500000000 + Random(randomRange * 100000));
    Close(InFile);
  end;
 
  procedure SortCountAkira;
  var
    I: LongInt;
    InOut: Text;
    Map: TDictionary<LongInt, LongInt>;
    Pair: TIntPair;
    Pairs: TArray<TIntPair>;
  begin
    routineName := {$I %currentroutine%};
    Map := TDictionary<LongInt, LongInt>.Create();
    //Map.Capacity := 10000000;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do begin
      ReadLn(InOut, I);
      Inc(Total);
      if not Map.ContainsKey(I) then
        begin
          Map.Add(I, 1);
          Inc(Unique);
        end
      else
        Map[I] := Map[I] + 1;
    end;
    Close(InOut);
    Pairs := Map.ToArray();
    Map.Free();
    TArrayHelper<TIntPair>.Sort(
      Pairs,
      TComparer<TIntPair>.Construct(ComparePairs)
    );
    Assign(InOut, outFilename);
    Rewrite(InOut);
    for Pair in Pairs do with Pair do
      WriteLn(InOut, Key, ' - ', Value);
    Close(InOut);
  end;
 
  procedure SortCountHoward;
  var
    arr: array of Integer;
    textf: TextFile;
    min: Integer = High(Integer);
    max: Integer = -1;
    i: Integer;
  begin
    routineName := {$I %currentroutine%};
    AssignFile(textf, inFilename);
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Inc(Total);
        if i < min then
          min := i;
        if i > max then
          max := i;
      end;
    SetLength(arr, max-min+1);
 
    Reset(textf);
    while not EOF(textf) do
      begin
        ReadLn(textf, i);
        Dec(i, min);
        Inc(arr[i]);
      end;
    CloseFile(textf);
 
    AssignFile(textf, outFilename);
    Rewrite(textf);
    for i := Low(arr) to High(arr) do
      case (arr[i] > 0) of
        True:
          begin
            WriteLn(textf, i+min, ' - ', arr[i]);
            Inc(Unique);
          end;
      end;
    CloseFile(textf);
    SetLength(arr, 0);
  end;
 
  procedure SortCountAvk1;
  type
    TCounter  = TGHashMultiSetLP<Integer>;
    TCountRef = TGAutoRef<TCounter>;
    TEntry    = TCounter.TEntry;
 
    function EntryCmp(constref L, R: TEntry): SizeInt;
    begin
      if L.Key > R.Key then
        Result := 1
      else
        if L.Key < R.Key then
          Result := -1
        else
          Result := 0;
    end;
 
  var
    CountRef: TCountRef;
    InOut: Text;
    Counter: TCounter;
    e: TEntry;
    I: Integer;
  begin
    routineName := {$I %currentroutine%};
    Counter := CountRef;
    //Counter.LoadFactor := 0.7;
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, I);
        Counter.Add(I);
      end;
    Close(InOut);
    Total := Counter.Count;
    Unique := Counter.EntryCount;
    if Counter.NonEmpty then
      begin
        Assign(InOut, outFilename);
        Rewrite(InOut);
        for e in Counter.Entries.Sorted(EntryCmp) do
          with e do
            WriteLn(InOut, Key, ' - ', Count);
        Close(InOut);
      end;
  end;
 
  procedure SortCountAvk2;
  type
    TIntArray = array of Integer;
 
    function LoadData: TIntArray;
    var
      PCurr, PLast: PByte;
      DataSize, CurrValue, I: Integer;
      DoReading: Boolean = False;
    begin
      Result := nil;
      I := 0;
      with TMemoryStream.Create do
        try
          LoadFromFile(inFileName);
          DataSize := Size;
          if DataSize <= 0 then
            exit;
          SetLength(Result, DataSize div 10);
          PCurr := Memory;
          PLast := PCurr + Size;
          CurrValue := 0;
          repeat
            if PCurr^ > $0D then
              begin
                DoReading := True;
                CurrValue := CurrValue * 10 + PCurr^ - Ord('0');
              end
            else
              if DoReading then
                begin
                  if Length(Result) = I then
                    SetLength(Result, I * 2);
                  Result[I] := CurrValue;
                  Inc(I);
                  CurrValue := 0;
                  DoReading := False;
                end;
            Inc(PCurr);
          until PCurr > PLast;
        finally
          Free;
        end;
      SetLength(Result, I);
    end;
  var
    List: TIntArray;
    InOut: Text;
    I, J, Count, DupCount: Integer;
  begin
    routineName := {$I %currentroutine%};
    List := LoadData;
    if List = nil then
      exit;
    TGOrdinalArrayHelper<Integer>.Sort(List);
    Total := Length(List);
    Count := Total;
    DupCount := 0;
    I := 0;
    Assign(InOut, outFilename+'1');
    Rewrite(InOut);
    repeat
      J := List[I];
      while (I < Count) and (List[I] = J) do
        begin
          Inc(DupCount);
          Inc(I);
        end;
      WriteLn(InOut, J, ' - ', DupCount);
      Inc(Unique);
      DupCount := 0;
    until I = Count;
    Close(InOut);
  end;
 
  procedure SortCountJulkas1;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  const
    bsz = 1 shl 17;
  var
    sc: array[0..2147483647 shr 17] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        sc[key shr 17].PushBack(key and $1FFFF);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    offset := -bsz;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, bsz);
      pkey := -1;
      cnt := 0;
      if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            WriteLn(InOut, offset + pkey, ' - ', cnt);
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        WriteLn(InOut, offset + pkey, ' - ', cnt);
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do sc[i].Destroy;
  end;
 
  procedure SortCountJulkas2;
  type
    TIntLess = TLess<LongInt>;
    TIntVect = TVector<LongInt>;
    TOrd = TOrderingArrayUtils<TIntVect, LongInt, TIntLess>;
  var
    sc: array[0..21474] of TIntVect;
    i: LongInt;
    pkey, key, cnt: LongInt;
    offset: LongInt;
    InOut: Text;
  begin
    routineName := {$I %currentroutine%};
    for i := Low(sc) to High(sc) do sc[i] := TIntVect.Create;
 
    Assign(InOut, inFilename);
    Reset(InOut);
    while not EOF(InOut) do
      begin
        ReadLn(InOut, key);
        Inc(Total);
        sc[key div 100000].PushBack(key mod 100000);
      end;
    Close(InOut);
 
    Assign(InOut, outFilename);
    Rewrite(InOut);
 
    for i := Low(sc) to High(sc) do if sc[i].Size > 1 then TOrd.Sort(sc[i], sc[i].Size);
 
    offset := -100000;
    for i := Low(sc) to High(sc) do
    begin
      Inc(offset, 100000);
      pkey := -1;
      cnt := 0;
      for key in sc[i] do
      begin
        if pkey <> key then
        begin
          if cnt <> 0 then
          begin
            WriteLn(InOut, offset + pkey, ' - ', cnt);
            Inc(Unique);
          end;
          pkey := key;
          cnt := 0;
        end;
        Inc(cnt);
      end;
      if cnt <> 0 then
      begin
        WriteLn(InOut, offset + pkey, ' - ', cnt);
        Inc(Unique);
      end;
    end;
 
    Close(InOut);
    for i := Low(sc) to High(sc) do
      sc[i].Destroy;
  end;
 
  procedure SortCount440bx;
  begin
    routineName := {$I %currentroutine%};
    WinSortCount3.DataCount := 0;
    WinSortCount3.Unique := 0;
    WinSortCount3.InFileName := inFilename;
    WinSortCount3.OutFileName := outFilename;
    WinSortCount3.SortCount;
    Total := WinSortCount3.DataCount;
    Unique := WinSortCount3.Unique;
  end;
 
  procedure SortCountBrunoK;  { Note : requires Classes }
  const
    cCR = $0D;
    c0 = Ord('0');
    function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
    var
      { Parse lines }
      lPByte, lPEndByte: PByte;
 
      { Values extraction }
      lPByteTextStart: PByte = nil;
      lValueStarted: boolean = False;
      lDWORD: DWORD;
 
      lCntRec: integer;
    begin
      { Prepare aList }
      lCntRec := aMemStream.Size;
      if lCntRec <= 0 then // Stream empty ?
        exit(0);
      aList.Count := lCntRec div 10; // Setup approximative size
      aList.Count := 0;
 
      lPByte := PByte(aMemStream.memory);
      lPEndByte := lPByte + aMemStream.Size;
 
      while lPByte <= lPEndByte do begin
        if (lPByte = lPEndByte) or (lPByte^ <= cCR) then begin
          if lValueStarted then begin
            lDWORD := 0;
            while lPByteTextStart < lPByte do begin
              lDWORD := lDWORD * 10 + lPByteTextStart^ - c0;
              Inc(lPByteTextStart);
            end;
            aList.Add(Pointer(lDWORD));
            lValueStarted := False;
          end;
        end
        else if not lValueStarted then begin
          lPByteTextStart := lPByte;
          lValueStarted := True;
        end;
        Inc(lPByte);
      end;
      Result := aList.Count;
    end;
 
    function BkCompare(Item1, Item2: Pointer): integer;
    begin
      Result := 1;
      if Item1 < Item2 then
        Result := -1
      else if Item1 = Item2 then
        Result := 0;
    end;
 
  var
    lFile: TextFile;
    lMemStream: TMemoryStream;
    lNbRecs: integer = 0;
    lFPList: TFPList;
    lIx: integer;
    lLastValue: pointer;
    lListCount: integer;
    lLastValueCount: integer;
  begin
    routineName := 'SortCountBrunoK'; //{$I %currentroutine%};
    lMemStream := TMemoryStream.Create;
    lMemStream.LoadFromFile(inFileName);
    lFPList := TFPList.Create;
    lNbRecs := LoadStreamToList(lMemStream, lFPList);
    lMemStream.Free; // Not needed anymore
    if lNbRecs > 0 then begin
      AssignFile(lFile, outFilename);
      Rewrite(lFile);
      lFPList.Sort(@BkCompare);
      lIx := 0;
      lLastValue := lFPList[lIx];
      lLastValueCount := 1;
      lListCount := lFPList.Count;
      repeat
        Inc(lIx);
        if (lIx >= lListCount) or (lFPList[lIx] <> lLastValue) then begin
          Inc(unique);
          WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
          if (lIx >= lListCount) then
            Break;
          lLastValue := lFPList[lIx];
          lLastValueCount := 1;
        end
        else
          Inc(lLastValueCount);
      until False;
      CloseFile(lFile);
      Total := lIx;
    end;
    lFPList.Free;
  end;
 
  procedure Run(aProc: TProcedure);
  begin
    Total := 0;
    Unique := 0;
    Start := Now;
    try
      aProc();
      WriteLn(Copy(routineName, 10, 20):7,'''s time: ', MilliSecondsBetween(Now(), Start) / 1000.0 : 0 : 4,#9'#unique: ',Unique,' #total: ',Total);
    except
      on e: Exception do
        WriteLn('crashes with message "', e.Message, '"');
    end;
  end;
 
begin
  Randomize;
 
  procedures := TProcedureArray.Create(
                   SortCountJulkas1, SortCountJulkas2, SortCountAkira, SortCountHoward, SortCountAvk1,
                   SortCountAvk2, SortCount440bx, SortCountBrunoK);
 
  for randomrange := 1 to 10 do
    begin
      GenerateData(randomrange, 10);
      WriteLn(#10'RandomRange = ',randomrange);
      for proc in procedures do
        Run(proc);
    end;
 
  for repeatCount := 1 to 10 do
    begin
      GenerateData(8, 2*repeatCount);
      WriteLn(#10'repeatMillionsCount = ', 2*repeatCount);
      for proc in procedures do
        Run(proc);
    end;
end.
 

Results in attachment.

Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 10:00:58 am

Quote from: 440bx on August 01, 2019, 09:29:21 am

Quote from: MathMan on August 01, 2019, 08:58:58 am
Code: [Select]
Result := 1 - integer( a=b ) - 2*integer( a<b );At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.

MathMan
I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.

I did measure with 1 mio random pair entries (which might not be 100% equivalent to the calling patterns of the sort function) and the executon time were 16 mio clock cycles vs. 5.5 mio clock cycles on my system. The branch misprediction penalty on modern CPUs is quite high - in the range of 20-50 clock cycles - and for random entries you'll have around 50% misprediction.

MathMan

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 10:12:46 am

And 32-bit results:

Code: Text [Select][+]

 
RandomRange = 1
Julkas1's time: 3.9780  #unique: 100000 #total: 10000000
Julkas2's time: 3.9630  #unique: 100000 #total: 10000000
  Akira's time: 3.4780  #unique: 100000 #total: 10000000
 Howard's time: 5.1170  #unique: 100000 #total: 10000000
   Avk1's time: 3.1360  #unique: 100000 #total: 10000000
   Avk2's time: 0.4990  #unique: 100000 #total: 10000000
  440bx's time: 2.2930  #unique: 100000 #total: 10000000
 BrunoK's time: 2.2310  #unique: 100000 #total: 10000000
 
RandomRange = 2
Julkas1's time: 3.9460  #unique: 200000 #total: 10000000
Julkas2's time: 3.9630  #unique: 200000 #total: 10000000
  Akira's time: 3.7440  #unique: 200000 #total: 10000000
 Howard's time: 5.1630  #unique: 200000 #total: 10000000
   Avk1's time: 3.3860  #unique: 200000 #total: 10000000
   Avk2's time: 0.5140  #unique: 200000 #total: 10000000
  440bx's time: 2.4340  #unique: 200000 #total: 10000000
 BrunoK's time: 2.2930  #unique: 200000 #total: 10000000
 
RandomRange = 3
Julkas1's time: 3.9780  #unique: 300000 #total: 10000000
Julkas2's time: 3.9780  #unique: 300000 #total: 10000000
  Akira's time: 3.9000  #unique: 300000 #total: 10000000
 Howard's time: 5.2260  #unique: 300000 #total: 10000000
   Avk1's time: 3.5560  #unique: 300000 #total: 10000000
   Avk2's time: 0.5460  #unique: 300000 #total: 10000000
  440bx's time: 2.5430  #unique: 300000 #total: 10000000
 BrunoK's time: 2.3710  #unique: 300000 #total: 10000000
 
RandomRange = 4
Julkas1's time: 4.0090  #unique: 400000 #total: 10000000
Julkas2's time: 4.0250  #unique: 400000 #total: 10000000
  Akira's time: 4.1030  #unique: 400000 #total: 10000000
 Howard's time: 5.3190  #unique: 400000 #total: 10000000
   Avk1's time: 3.6040  #unique: 400000 #total: 10000000
   Avk2's time: 0.5770  #unique: 400000 #total: 10000000
  440bx's time: 2.6050  #unique: 400000 #total: 10000000
 BrunoK's time: 2.4490  #unique: 400000 #total: 10000000
 
RandomRange = 5
Julkas1's time: 4.0560  #unique: 500000 #total: 10000000
Julkas2's time: 4.0410  #unique: 500000 #total: 10000000
  Akira's time: 4.3050  #unique: 500000 #total: 10000000
 Howard's time: 5.4130  #unique: 500000 #total: 10000000
   Avk1's time: 3.6820  #unique: 500000 #total: 10000000
   Avk2's time: 0.6240  #unique: 500000 #total: 10000000
  440bx's time: 2.6520  #unique: 500000 #total: 10000000
 BrunoK's time: 2.4960  #unique: 500000 #total: 10000000
 
RandomRange = 6
Julkas1's time: 4.0870  #unique: 600000 #total: 10000000
Julkas2's time: 4.0560  #unique: 600000 #total: 10000000
  Akira's time: 4.3680  #unique: 600000 #total: 10000000
 Howard's time: 5.4910  #unique: 600000 #total: 10000000
   Avk1's time: 3.7910  #unique: 600000 #total: 10000000
   Avk2's time: 0.6710  #unique: 600000 #total: 10000000
  440bx's time: 2.7300  #unique: 600000 #total: 10000000
 BrunoK's time: 2.5290  #unique: 600000 #total: 10000000
 
RandomRange = 7
Julkas1's time: 4.1340  #unique: 700000 #total: 10000000
Julkas2's time: 4.1030  #unique: 700000 #total: 10000000
  Akira's time: 4.4930  #unique: 700000 #total: 10000000
 Howard's time: 5.5850  #unique: 700000 #total: 10000000
   Avk1's time: 3.8840  #unique: 700000 #total: 10000000
   Avk2's time: 0.7020  #unique: 700000 #total: 10000000
  440bx's time: 2.7770  #unique: 700000 #total: 10000000
 BrunoK's time: 2.5740  #unique: 700000 #total: 10000000
 
RandomRange = 8
Julkas1's time: 4.1340  #unique: 799992 #total: 10000000
Julkas2's time: 4.1500  #unique: 799992 #total: 10000000
  Akira's time: 4.6020  #unique: 799992 #total: 10000000
 Howard's time: 5.6630  #unique: 799992 #total: 10000000
   Avk1's time: 3.9160  #unique: 799992 #total: 10000000
   Avk2's time: 0.7640  #unique: 799992 #total: 10000000
  440bx's time: 2.8240  #unique: 799992 #total: 10000000
 BrunoK's time: 2.6200  #unique: 799992 #total: 10000000
 
RandomRange = 9
Julkas1's time: 4.2430  #unique: 899989 #total: 10000000
Julkas2's time: 4.1970  #unique: 899989 #total: 10000000
  Akira's time: 4.6950  #unique: 899989 #total: 10000000
 Howard's time: 5.6940  #unique: 899989 #total: 10000000
   Avk1's time: 3.9940  #unique: 899989 #total: 10000000
   Avk2's time: 0.7960  #unique: 899989 #total: 10000000
  440bx's time: 2.8860  #unique: 899989 #total: 10000000
 BrunoK's time: 2.6360  #unique: 899989 #total: 10000000
 
RandomRange = 10
Julkas1's time: 4.1970  #unique: 999951 #total: 10000000
Julkas2's time: 4.1960  #unique: 999951 #total: 10000000
  Akira's time: 4.7740  #unique: 999951 #total: 10000000
 Howard's time: 5.7720  #unique: 999951 #total: 10000000
   Avk1's time: 4.0560  #unique: 999951 #total: 10000000
   Avk2's time: 0.8420  #unique: 999951 #total: 10000000
  440bx's time: 2.9490  #unique: 999951 #total: 10000000
 BrunoK's time: 2.6980  #unique: 999951 #total: 10000000
 
repeatMillionsCount = 2
Julkas1's time: 1.0140  #unique: 734214 #total: 2000000
Julkas2's time: 0.9990  #unique: 734214 #total: 2000000
  Akira's time: 1.4660  #unique: 734214 #total: 2000000
 Howard's time: 1.3260  #unique: 734214 #total: 2000000
   Avk1's time: 1.1230  #unique: 734214 #total: 2000000
   Avk2's time: 0.3590  #unique: 734214 #total: 2000000
  440bx's time: 0.6860  #unique: 734214 #total: 2000000
 BrunoK's time: 0.6870  #unique: 734214 #total: 2000000
 
repeatMillionsCount = 4
Julkas1's time: 1.8100  #unique: 794501 #total: 4000000
Julkas2's time: 1.7940  #unique: 794501 #total: 4000000
  Akira's time: 2.3240  #unique: 794501 #total: 4000000
 Howard's time: 2.4180  #unique: 794501 #total: 4000000
   Avk1's time: 1.8410  #unique: 794501 #total: 4000000
   Avk2's time: 0.4680  #unique: 794501 #total: 4000000
  440bx's time: 1.2630  #unique: 794501 #total: 4000000
 BrunoK's time: 1.1550  #unique: 794501 #total: 4000000
 
repeatMillionsCount = 6
Julkas1's time: 2.5900  #unique: 799570 #total: 6000000
Julkas2's time: 2.5580  #unique: 799570 #total: 6000000
  Akira's time: 3.0890  #unique: 799570 #total: 6000000
 Howard's time: 3.4940  #unique: 799570 #total: 6000000
   Avk1's time: 2.5270  #unique: 799570 #total: 6000000
   Avk2's time: 0.5620  #unique: 799570 #total: 6000000
  440bx's time: 1.8100  #unique: 799570 #total: 6000000
 BrunoK's time: 1.6380  #unique: 799570 #total: 6000000
 
repeatMillionsCount = 8
Julkas1's time: 3.3860  #unique: 799965 #total: 8000000
Julkas2's time: 3.3540  #unique: 799965 #total: 8000000
  Akira's time: 3.8530  #unique: 799965 #total: 8000000
 Howard's time: 4.6170  #unique: 799965 #total: 8000000
   Avk1's time: 3.2450  #unique: 799965 #total: 8000000
   Avk2's time: 0.6870  #unique: 799965 #total: 8000000
  440bx's time: 2.3080  #unique: 799965 #total: 8000000
 BrunoK's time: 2.1380  #unique: 799965 #total: 8000000
 
repeatMillionsCount = 10
Julkas1's time: 4.1340  #unique: 799998 #total: 10000000
Julkas2's time: 4.1650  #unique: 799998 #total: 10000000
  Akira's time: 4.6020  #unique: 799998 #total: 10000000
 Howard's time: 5.6630  #unique: 799998 #total: 10000000
   Avk1's time: 3.9150  #unique: 799998 #total: 10000000
   Avk2's time: 0.7650  #unique: 799998 #total: 10000000
  440bx's time: 2.8390  #unique: 799998 #total: 10000000
 BrunoK's time: 2.6050  #unique: 799998 #total: 10000000
 
repeatMillionsCount = 12
Julkas1's time: 4.9290  #unique: 800000 #total: 12000000
Julkas2's time: 4.8990  #unique: 800000 #total: 12000000
  Akira's time: 5.3660  #unique: 800000 #total: 12000000
 Howard's time: 6.7240  #unique: 800000 #total: 12000000
   Avk1's time: 4.8860  #unique: 800000 #total: 12000000
   Avk2's time: 0.9230  #unique: 800000 #total: 12000000
  440bx's time: 3.5090  #unique: 800000 #total: 12000000
 BrunoK's time: 3.1780  #unique: 800000 #total: 12000000
 
repeatMillionsCount = 14
Julkas1's time: 6.0210  #unique: 800000 #total: 14000000
Julkas2's time: 5.7530  #unique: 800000 #total: 14000000
  Akira's time: 6.1250  #unique: 800000 #total: 14000000
 Howard's time: 7.8410  #unique: 800000 #total: 14000000
   Avk1's time: 5.3080  #unique: 800000 #total: 14000000
   Avk2's time: 0.9670  #unique: 800000 #total: 14000000
  440bx's time: 3.8690  #unique: 800000 #total: 14000000
 BrunoK's time: 3.5880  #unique: 800000 #total: 14000000
 
repeatMillionsCount = 16
Julkas1's time: 6.4900  #unique: 800000 #total: 16000000
Julkas2's time: 6.4720  #unique: 800000 #total: 16000000
  Akira's time: 6.9760  #unique: 800000 #total: 16000000
 Howard's time: 8.9740  #unique: 800000 #total: 16000000
   Avk1's time: 6.0190  #unique: 800000 #total: 16000000
   Avk2's time: 1.0460  #unique: 800000 #total: 16000000
  440bx's time: 4.3830  #unique: 800000 #total: 16000000
 BrunoK's time: 4.0730  #unique: 800000 #total: 16000000
 
repeatMillionsCount = 18
Julkas1's time: 8.0010  #unique: 800000 #total: 18000000
Julkas2's time: 7.7100  #unique: 800000 #total: 18000000
  Akira's time: 8.4710  #unique: 800000 #total: 18000000
 Howard's time: 10.9400 #unique: 800000 #total: 18000000
   Avk1's time: 7.3030  #unique: 800000 #total: 18000000
   Avk2's time: 1.1800  #unique: 800000 #total: 18000000
  440bx's time: 4.9460  #unique: 800000 #total: 18000000
 BrunoK's time: 4.7270  #unique: 800000 #total: 18000000
 
repeatMillionsCount = 20
Julkas1's time: 8.2490  #unique: 800000 #total: 20000000
Julkas2's time: 8.3960  #unique: 800000 #total: 20000000
  Akira's time: 8.8410  #unique: 800000 #total: 20000000
 Howard's time: 11.6740 #unique: 800000 #total: 20000000
   Avk1's time: 7.5300  #unique: 800000 #total: 20000000
   Avk2's time: 1.2660  #unique: 800000 #total: 20000000
  440bx's time: 5.5000  #unique: 800000 #total: 20000000
 BrunoK's time: 5.1120  #unique: 800000 #total: 20000000
 

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 10:28:27 am

Quote from: avk on August 01, 2019, 09:46:58 am

And this can not be fixed?

It sure can and, Bruno and yourself are using the "fix" I'd have to use, which is, doing string to integer conversion without readln do it for you (which is slow).

I have mixed feelings about using that optimization for this problem. To keep the code simple and easy to understand, I'd use ntdll's atoi function but, there is no way for calls to atoi to beat an inline implementation that does not even check for overflows.

Both of your algorithms can be made even faster by not using writeln. I suppose there must be an object (I'm guessing, a TMemoryStream), that would allow writing an entire block of memory (properly formatted beforehand) in one shot instead of a gazillion writeln(s).

IMO, your avk2 algorithm has the best balance between being clean, easy to understand and fast. That's what a good program is. To me, that's the winner.

Title: Re: Sorting and Counting
Post by: hnb on August 01, 2019, 10:34:33 am

small update : my assumption and new propositions for rtl-generics was wrong. After tests I can say one : the Akira proposition for rtl-generics is better than my ideas, not much difference but Akira wins. The positive aspect: thanks to this topic I have ideas to update library (not directly related to sorting and counting), but in general with positive influence on performance and new functionalities.

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 12:19:07 pm

Quote from: 440bx on August 01, 2019, 10:28:27 am

...Both of your algorithms can be made even faster by not using writeln. I suppose there must be an object (I'm guessing, a TMemoryStream), that would allow writing an entire block of memory (properly formatted beforehand) in one shot instead of a gazillion writeln(s)...

IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.

Quote from: MathMan on August 01, 2019, 08:58:58 am

...
Code: [Select]
Result := 1 - integer( a=b ) - 2*integer( a<b );...

But why not

Code: Pascal [Select][+]

  Result := Integer(a > b) - Integer(a < b);
 

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 12:29:48 pm

Quote from: avk on August 01, 2019, 12:19:07 pm

IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.

I completely agree with that. I admit to being curious too and, there are a number of optimizations that come to mind but, it really feels they are completely out of place for what should be (and can be) a very simple program.

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 01:36:53 pm

I almost forgot,
@hnb, I don’t know if you are aware of such a problem:

Code: Pascal [Select][+]

function GenTestArray: specialize TArray<Integer>;
const
  TestSize = 200000;
var
  I, J: Integer;
begin
  SetLength(Result, TestSize);
  for I := 0 to Pred(TestSize div 2) do
    Result[I] := I;
  J := 0;
  for I := TestSize div 2 to High(Result) do
    begin
      Result[I] := J;
      Inc(J);
    end;
end;
 

try sorting this array using TArrayHelper.

Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 02:28:20 pm

Quote from: avk on August 01, 2019, 12:19:07 pm

...
Quote from: MathMan on August 01, 2019, 08:58:58 am
...
Code: Pascal [Select][+][-]
Result := 1 - integer( a=b ) - 2*integer( a<b );

...
But why not
Code: Pascal [Select][+][-]
Result := Integer(a > b) - Integer(a < b);

?

Mainly because I didn't thought of it ;-)

Title: Re: Sorting and Counting
Post by: wp on August 01, 2019, 02:52:11 pm

Quote from: MathMan on August 01, 2019, 08:58:58 am

Code: [Select]
Result := 1 - integer( a=b ) - 2*integer( a<b );

Why not add logarithms to increase the effect of obfuscation. ;D

In earnest: If only the sign of the result of the compare function is evaluated by the sort, wouldn't it be sufficient to just subtract the values?

Code: Pascal [Select][+]

function ComparePairs(constref L, R: TIntPair): LongInt;
begin
  Result := L.Key - R.Key;
end;
 

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 03:27:23 pm

@wp

Quote from: wp on August 01, 2019, 02:52:11 pm

In earnest: If only the sign of the result of the compare function is evaluated by the sort, wouldn't it be sufficient to just subtract the values?
Code: Pascal [Select][+][-]
function ComparePairs(constref L, R: TIntPair): LongInt;
begin
Result := L.Key - R.Key;
end;

You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.

Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way. I cannot think of one.

Title: Re: Sorting and Counting
Post by: howardpc on August 01, 2019, 04:03:14 pm

Quote from: 440bx on August 01, 2019, 03:27:23 pm

Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way. I cannot think of one.

I think it is normally done that way.
I've seen that very code (or something almost identical) both in this forum (I believe it was in code from Marco) and in the FPC sources.

Title: Re: Sorting and Counting
Post by: MathMan on August 01, 2019, 04:13:24 pm

Quote from: 440bx on August 01, 2019, 03:27:23 pm

You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.

Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way. I cannot think of one.

Hm - in this case, what about range check errors? The comparison is save, but the subtraction is not, or is it?

But yes, if only the sign is required then simple subtraction should be sufficient. However the compare & cast hands back a ternary state, as did the initial comparison function.

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 04:15:08 pm

Quote from: howardpc on August 01, 2019, 04:03:14 pm

I think it is normally done that way.
I've seen that very code (or something almost identical) both in this forum (I believe it was in code from Marco) and in the FPC sources.

I've read a lot of code in various languages and, I think it's the first time I see it done that way, because I'd remember it. Now that I've seen it, I'm not about to forget it.

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 04:18:20 pm

For this particular benchmark, this will work.
But what happens if, for example, L.Key = 1500000000 and R.Key = -1500000005?

Title: Re: Sorting and Counting
Post by: 440bx on August 01, 2019, 05:06:20 pm

@MathMan

Quote from: MathMan on August 01, 2019, 04:13:24 pm

But yes, if only the sign is required then simple subtraction should be sufficient.

For a sort compare function only the sign should matter (provided the sort function doesn't compare against hard coded values, -1, 0, 1, which it definitely shouldn't.)

@avk

Quote from: avk on August 01, 2019, 04:18:20 pm

But what happens if, for example, L.Key = 1500000000 and R.Key = -1500000005?

Yes, you are right. Those values would cause an overflow which would incorrectly indicate that L is less than R.

Both you, and MathMan are correct, doing comparisons avoids that problem.

Thank you both, for pointing out that problem (which now seems obvious too.)

Title: Re: Sorting and Counting
Post by: BrunoK on August 01, 2019, 05:31:50 pm

My last word supporting Signed SizeInt values.

Code: Pascal [Select][+]

 
  { TSortCountList }
type
  TSortCountList = class(TFPList)
  public
    procedure QuickSort;
  end;
 
procedure SortCountBrunoK1;  { Note : requires Classes }
const
  cCR = $0D;
  cETX = $03;
  c0 = Ord('0');
  function LoadStreamToList(aMemStream: TMemoryStream; aList: TFPList): integer;
  var
    { Parse lines }
    lPByte, lPEndByte: PByte;
 
    { Values extraction }
    lValueStarted: boolean = False;
    lSizeInt: SizeInt = 0;
    lMulSign: integer = 1;
    lCntRec: integer;
  begin
    { Prepare aList }
    lCntRec := aMemStream.Size;
    if lCntRec <= 0 then // Stream empty ?
      exit(0);
    aList.Count := lCntRec div 10; // Setup approximative size
    aList.Count := 0;
 
    lPByte := PByte(aMemStream.memory);
    lPEndByte := lPByte + aMemStream.Size;
    (lPEndByte -1)^ := cETX;
    while lPByte <= lPEndByte do begin
      if (lPByte^ <= cCR) then begin
        if lValueStarted then begin
          aList.Add(Pointer(lMulSign * lSizeInt));
          lSizeInt := 0;
          lMulSign := 1;
          lValueStarted := False;
        end;
      end
      else begin
        if lPByte^ in [Ord('+'),Ord('-')] then begin
          if lPByte^ = Ord('-') then
            lMulSign := lMulSign * -1;
        end
        else begin
          lValueStarted := True;
          lSizeInt := lSizeInt * 10 + (lPByte^ - c0);
        end;
      end;
      Inc(lPByte);
    end;
    Result := aList.Count;
  end;
var
  lFile: TextFile;
  lMemStream: TMemoryStream;
  lNbRecs: integer = 0;
  lSortCountList: TSortCountList;
  lIx: integer;
  lLastValue: pointer;
  lListCount: integer;
  lLastValueCount: integer;
  lWriteTextLn : shortstring;
begin
  routineName := 'SortCountBrunoK1'; // {$I %currentroutine%};
  lMemStream := TMemoryStream.Create;
  lMemStream.LoadFromFile(inFileName);
  lSortCountList := TSortCountList.Create;
  lNbRecs := LoadStreamToList(lMemStream, lSortCountList);
  lMemStream.Free; // Not needed anymore
  if lNbRecs > 0 then begin
    AssignFile(lFile, outFilename);
    Rewrite(lFile);
    lSortCountList.QuickSort;
    lIx := 0;
    lLastValue := lSortCountList[lIx];
    lLastValueCount := 1;
    lListCount := lSortCountList.Count;
    lSortCountList.Add(nil);
    repeat
      Inc(lIx);
      if (lSortCountList[lIx] <> lLastValue) then begin
        Inc(unique);
        WriteLn(lFile, UINTPTR(lLastValue), ' - ', lLastValueCount);
        if (lIx >= lListCount) then
          Break;
        lLastValue := lSortCountList[lIx];
        lLastValueCount := 1;
      end
      else
        Inc(lLastValueCount);
    until False;
    CloseFile(lFile);
    Total := lIx;
  end;
  lSortCountList.Free;
end;
 
{ TSortCountList }
 
type
  PSizeIntList = ^TSizeIntList;
  TSizeIntList = array[0..MaxListSize - 1] of SizeInt;
 
procedure TSortCountList.QuickSort;
var
  _list: PSizeIntList;
  procedure _QSort(L, R: integer);
  var
    I, J: integer;
    P, Q: SizeInt;
  begin
    repeat
      I := L;
      J := R;
      P := SizeInt(_list^[(L + R) div 2]);
      repeat
        while SizeInt(_list^[i]) < P do
          I := I + 1;
        while P < SizeInt(_list^[J]) do
          J := J - 1;
        if I <= J then begin
          Q := _list^[I];
          _list^[I] := _list^[J];
          _list^[J] := Q;
          I := I + 1;
          J := J - 1;
        end;
      until I > J;
      // sort the smaller range recursively
      // sort the bigger range via the loop
      // Reasons: memory usage is O(log(n)) instead of O(n) and loop is faster than recursion
      if J - L < R - I then begin
        if L < J then
          _QSort(L, J);
        L := I;
      end
      else begin
        if I < R then
          _QSort(I, R);
        R := J;
      end;
    until L >= R;
  end;
begin
  if not Assigned(List) or (Count < 2) then exit;
  _List := PSizeIntList(List);
  _QSort(0, Count - 1);
end;
 

Title: Re: Sorting and Counting
Post by: avk on August 01, 2019, 06:09:12 pm

@BrunoK, do you want the new version to be added to the benchmark and runned?

Title: Re: Sorting and Counting
Post by: mpknap on November 09, 2019, 08:10:04 am

Gentlemen, does anyone know how to write it in Python?

Title: Re: Sorting and Counting
Post by: bytebites on November 09, 2019, 09:46:44 am

Without sorting

Code: Python [Select][+]

from collections import defaultdict
d=defaultdict(int)
with open("infile") as f:
  for s in f:
    k=s.strip(chr(10))
    if k:
      d[k]+=1
with open("outfile","w") as o:
  for (k,v) in d.items():
    print(f'{k} - {v}',file=o)     

Title: Re: Sorting and Counting
Post by: jamie on November 09, 2019, 03:36:14 pm

Ugg , looks too much like BASIC.....
>:(

Title: Re: Sorting and Counting
Post by: julkas on November 09, 2019, 03:46:39 pm

Quote from: jamie on November 09, 2019, 03:36:14 pm

Ugg , looks too much like BASIC.....
>:(

BASIC rocks.
BASIC is great.
BASIC forever.

Title: Re: Sorting and Counting
Post by: mpknap on November 09, 2019, 09:05:41 pm

Quote from: jamie on November 09, 2019, 03:36:14 pm

Ugg , looks too much like BASIC.....
>:(

but it works :)
only sorting necessary for me ....

Title: Re: Sorting and Counting
Post by: 440bx on November 09, 2019, 10:02:56 pm

Quote from: mpknap on November 09, 2019, 09:05:41 pm

but it works :)

Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.

Title: Re: Sorting and Counting
Post by: avk on November 10, 2019, 10:18:55 am

You only have to wish, sir. :)
Python code:

Code: Python [Select][+]

from collections import defaultdict
import timeit
total=0
d=defaultdict(int)
stime=timeit.default_timer()
with open("infile") as f:
  for s in f:
    total +=1
    k=s.strip(chr(10))
    if k:
      d[k]+=1
with open("outfile","w") as o:
  for k in sorted(d.keys()):
    print(f'{k} - {d[k]}',file=o)
stime=timeit.default_timer()-stime
print('Time elapsed: ', stime, ', #unique: ', len(d), ', #total: ', total)
 

Output:

Code: Text [Select][+]

Time elapsed:  12.047126958000263 , #unique:  999955 , #total:  10000000
 

Pascal code:

Code: Pascal [Select][+]

program sort_count;
{$mode objfpc}
{$MODESWITCH NESTEDPROCVARS}
uses
  SysUtils, DateUtils,
  LGUtils, LGHashMultiSet, LGHelpers;
procedure SortCount;
type
  TCounter  = specialize TGHashMultiSetLP<Integer>;
  TCountRef = specialize TGAutoRef<TCounter>;
  TEntry    = TCounter.TEntry;
  function EntryCmp(constref L, R: TEntry): SizeInt;
  begin Result := Integer.Compare(L.Key, R.Key); end;
var
  CountRef: TCountRef;
  InOut: Text;
  Counter: TCounter;
  e: TEntry;
  I: Integer;
  stime: TTime;
begin
  Counter := CountRef;
  Assign(InOut, 'infile');
  Reset(InOut);
  stime := Time;
  while not EOF(InOut) do
    begin
      ReadLn(InOut, I);
      Counter.Add(I);
    end;
  Close(InOut);
  if Counter.NonEmpty then
    begin
      Assign(InOut, 'outfile');
      Rewrite(InOut);
      for e in Counter.Entries.Sorted(@EntryCmp) do
        WriteLn(InOut, e.Key, ' - ', e.Count);
      Close(InOut);
    end;
  WriteLn('Time elapsed: ', MillisecondsBetween(Time, stime)/1000:0:4,
          ', #unique: ', Counter.EntryCount, ', #total: ', Counter.Count);   
end;
begin
  SortCount;
end. 
 

Output:

Code: Text [Select][+]

Time elapsed: 3.2300, #unique: 999955, #total: 10000000
 

Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 10:38:42 am

Quote from: 440bx on November 09, 2019, 10:02:56 pm

Quote from: mpknap on November 09, 2019, 09:05:41 pm
but it works :)
Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.

Python sort implementation is exelent. It's based on Team sort algo.
https://stackoverflow.com/questions/1517347/about-pythons-built-in-sort-method
@avk Can you compare only sorting phase time?. (Python default I/O is slow).
Try with PyPy compiler also.

Title: Re: Sorting and Counting
Post by: 440bx on November 10, 2019, 10:57:36 am

Quote from: avk on November 10, 2019, 10:18:55 am

You only have to wish, sir. :)

Thank you very much Avk. :)

Title: Re: Sorting and Counting
Post by: avk on November 10, 2019, 11:12:46 am

Quote from: julkas on November 10, 2019, 10:38:42 am

Can you compare only sorting phase time?. (Python default I/O is slow).

Sort phase only:
Python

Code: Text [Select][+]

Time elapsed:  1.1443881560007867 , #unique:  999955 , #total:  10000000
 

Pascal

Code: Text [Select][+]

Time elapsed: 0.1750, #unique: 999955, #total: 10000000
 

Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 11:34:04 am

Quote from: avk on November 10, 2019, 11:12:46 am

Quote from: julkas on November 10, 2019, 10:38:42 am
Can you compare only sorting phase time?. (Python default I/O is slow).
Sort phase only:
Python
Code: Text [Select][+][-]
Time elapsed: 1.1443881560007867 , #unique: 999955 , #total: 10000000

Pascal
Code: Text [Select][+][-]
Time elapsed: 0.1750, #unique: 999955, #total: 10000000

@avk Thanks.
https://ideone.com/9n168K

Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 12:20:34 pm

Quote from: julkas on November 10, 2019, 11:34:04 am

@avk Thanks.
https://ideone.com/9n168K

Julkas, ideone lags behind in compiler version for FPC and that can make a big difference. You should run the complete test suite on ideone to get any meaningful comparison.
Better run 3.2.0 or test everything with 3.0.4.

Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 06:11:24 pm

We can't compare Python (interpreted Lang) with Pascal.
Anyway, I use Python, I like Python.

Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 06:17:20 pm

Quote from: julkas on November 10, 2019, 06:11:24 pm

We can't compare Python (interpreted Lang) with Pascal.

Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
(You can also use FPC to write and add Python libraries)
And indeed:
I use Python and I like Python... :P :o

Title: Re: Sorting and Counting
Post by: julkas on November 10, 2019, 07:19:42 pm

Quote from: Thaddy on November 10, 2019, 06:17:20 pm

Quote from: julkas on November 10, 2019, 06:11:24 pm
We can't compare Python (interpreted Lang) with Pascal.
Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
(You can also use FPC to write and add Python libraries)
And indeed:
I use Python and I like Python... :P :o

PYTHON rocks.
PYTHON is great.
PYTHON forever.
PYTHON νῦν και ἀεὶ.

Title: Re: Sorting and Counting
Post by: Thaddy on November 10, 2019, 07:29:25 pm

Well, as it seems it needs some FPC compiled sort libraries,,,,, 8-)

Title: Re: Sorting and Counting
Post by: mpknap on November 10, 2019, 08:11:23 pm

Ultimately, I use this algorithm, it is clear to me and because I have a problem installing the LG package ....
30MB file counted and sorted is 6 minutes;)

Quote from: howardpc on July 17, 2019, 02:13:18 pm

I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal [Select][+][-]
unit mainSortCount;

{$mode objfpc}{$H+}

interface

uses
Classes, SysUtils, Forms, StdCtrls;

type
TForm1 = class(TForm)
Memo1: TMemo;
procedure FormCreate(Sender: TObject);
end;

var
Form1: TForm1;

procedure SortCount(const anInFile: String; out aList: TStringList);

procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);

implementation

{$R *.lfm}

{ TForm1 }

procedure SortCount(const anInFile: String; out aList: TStringList);
const
one = PtrUInt(1);
var
textf: TextFile;
s: String;
idx: Integer;

function GetSuccObj(anIntObj: TObject): TObject;
var
i: PtrUInt absolute anIntObj;
begin
Inc(i);
Exit(anIntObj);
end;

begin
Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
aList := TStringList.Create;
aList.Duplicates := dupError;
aList.Sorted := True;
AssignFile(textf, anInFile);
try
Reset(textf);
while not EOF(textf) do
begin
ReadLn(textf, s);
s := Trim(s);
idx := aList.IndexOf(s);
case idx of
-1: aList.AddObject(s, TObject(one));
else
aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
end;
end;
finally
CloseFile(textf);
end;
end;

procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
var
i: Integer;
begin
if Assigned(aList) and Assigned(aMemo) then
begin
aMemo.Clear;
for i := 0 to aList.Count-1 do
aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
end;
end;

procedure TForm1.FormCreate(Sender: TObject);
var
sl: TStringList;
begin
SortCount('infile.txt', sl);
try
ShowListInMemo(sl, Memo1);
Memo1.Lines.SaveToFile('outfile.txt');
finally
sl.Free;
end;
end;

end.

As for Python, that was just curiosity, but thank you.

In general, the problem is a little different. Unixtime is one variable of a certain record. It's about displaying records and their numbers with the same Unixtime. After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)

Well, unless it's interesting for you;)

Title: Re: Sorting and Counting
Post by: 440bx on November 10, 2019, 08:20:42 pm

Quote from: mpknap on November 10, 2019, 08:11:23 pm

After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)

Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?

ETA: if Windows-only is acceptable then, another question: are the records in the file fixed length or variable length ?

Title: Re: Sorting and Counting
Post by: avk on November 11, 2019, 06:13:33 am

Quote from: mpknap on November 10, 2019, 08:11:23 pm

... I have a problem installing the LG package ....

What kind of problem?

Title: Re: Sorting and Counting
Post by: Thaddy on November 11, 2019, 08:54:20 am

What kind of install? Just make sure it is in your path.

Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 09:28:59 am

Quote from: 440bx on November 10, 2019, 08:20:42 pm

Quote from: mpknap on November 10, 2019, 08:11:23 pm
After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)
Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?

ETA: if Windows-only is acceptable then, another question: are the records in the file fixed length or variable length ?

I work in Windows 10.
At the beginning I asked about sorting Unix times alone, because I thought I could do it myself. Therefore, as an example I gave a different TXT file format.
The base is originally TXT in the format:

1570485826087 - UnixTime
0,0 -lat
0,0 -lon
Patrycja Maca┼éa -user Name
16584 -User Number
ekonomik1 -Team Name
1581 -Team Number
- Free line as separator
1570485826087
0,0
0,0
Gargastw├│r
17943
kosmos
1243

1570485909840
41,6548226860975
12,523491502443182
Emanuele Maria Latorre
4125
Divulgazione Libera
829

1570485929612
50,63142735
19,63182841
Przemek
137
Przemek
148

1570485941031
42,30451192
-71,22277853
Asaf
15190
Israel
2610

Sample file in the attachment, the smallest due to forum restrictions.

These are the times of detection received by users' smartphones in the CREDOSCIENCE project.

I am interested in obtaining such information :
- how many detections are in the same second / minute / hour. (I already got it thanks to your help)
- If there are several in the same second / minute / hour, display which users and their other data as number, coordinates, Team name etc ..

As I wrote, this is not my job, I'm not a programmer, it's fun for me in my free time. Thank you for your interest :)

Quote from: Thaddy on November 11, 2019, 08:54:20 am

What kind of install? Just make sure it is in your path.

I will try as you write. I installed the Lgenerics.LPK file, there was an error and I gave up.

Title: Re: Sorting and Counting
Post by: 440bx on November 11, 2019, 03:10:10 pm

Quote from: mpknap on November 11, 2019, 09:28:59 am

1570485826087 - UnixTime
0,0 -lat
0,0 -lon
Patrycja Maca┼éa -user Name
16584 -User Number
ekonomik1 -Team Name
1581 -Team Number
- Free line as separator
1570485826087
0,0
0,0
Gargastw├│r
17943
kosmos
1243

<many more>

1567365497795
0,0
0,0
J.C.K.
6037

1

1567367488139
0,0
0,0
J.C.K.
6037

1

Just to make sure I understand the structure of your file and its records. It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank. I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ?

Quote from: mpknap on November 11, 2019, 09:28:59 am

I am interested in obtaining such information :
- how many detections are in the same second / minute / hour. (I already got it thanks to your help)

Mostly with the help of the other contributors in this thread since my implementation doesn't even use the appropriate collating sequence.

Quote from: mpknap on November 11, 2019, 09:28:59 am

- If there are several in the same second / minute / hour, display which users and their other data as number, coordinates, Team name etc ..

Piece of pie. I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.

Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 03:42:51 pm

Quote from: 440bx on November 11, 2019, 03:10:10 pm

[/font][/size]
Just to make sure I understand the structure of your file and its records. It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank. I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ?

Yes. One record is 7 lines. The eighth line is the separator between records.

Quote from: mpknap on November 11, 2019, 09:28:59 am

Piece of pie. I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.

Thanks .

Title: Re: Sorting and Counting
Post by: mpknap on November 11, 2019, 08:32:44 pm

I will tell you what it is needed for.
Detections in smartphones are the impact of cosmic ray particles on the phone's camera (photons, electrons, muons). They are mainly single, not regular impacts on the camera matrix.
Scientists, creators of the CREDO project, are looking for the so-called showers, i.e. rain, hitting many phones at one time, or a short time interval.

Scientists are also looking for links between detection frequencies and other phenomena, with an earthquake, with medicine, and solar wind. I also want to put these data on top of each other. maybe someday I will compare with the results of scientists, maybe I will be the first, maybe not, but I will definitely not be bored with such tasks :)
I "play" such searches on my own.

Anyone can join the project. Data are widely available. They mainly operate on Python.

Currently, the biggest problem is the elimination of false detections. They are mainly committed by new users or those who want to be better in the ranking. For this you need specialists in image analysis.

I don't want you to think that I'm using you, what I do is on my own, for my own curiosity.

Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 02:23:06 am

Quote from: mpknap on November 11, 2019, 08:32:44 pm

I don't want you to think that I'm using you, what I do is on my own, for my own curiosity.

Don't worry about that.

Quote from: mpknap on November 11, 2019, 08:32:44 pm

Detections in smartphones are the impact of cosmic ray particles on the phone's camera (photons, electrons, muons).
<snip>
Scientists are also looking for links between detection frequencies and other phenomena,

Experiments like that have the potential to yield surprising results. Many scientific discoveries resulted from unexpected side effects. Some that come to mind are microwave ovens, the scientists were constantly having headaches, that's how they figured out that the microwaves were frying their brains (not kidding... though, saying "frying" is a bit of an exaggeration.)

One my favorite unexpected side effects is related to the smartphones you mention. You may have noticed that cell phones/smart phones no longer have an antenna that has to be pulled out of the phone in order to get reception (old cell phones used to have an antenna.) The reason there _seems_ to be no antenna goes back all the way to Georg Cantor, a mathematician who studied infinities, his work influenced Benoit Mandelbrot, who defined and studied the behavior of fractals, subsequently it was found, apparently mostly by accident, that fractal antennas had a significantly greater wave reception spectrum (due, in part, to their being partially dimensional.)

Without fractals antennas, smartphones would need about half a dozen different antennas protruding out of them in order to capture the spectrum of waves they need to implement all their nifty functions.

I didn't have any time to work of the program today but... I haven't forgotten.

Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 05:56:56 am

@mpknap, did you mean something like that?

Title: Re: Sorting and Counting
Post by: mpknap on November 12, 2019, 07:09:00 am

Quote from: avk on November 12, 2019, 05:56:56 am

@mpknap, did you mean something like that?

that's exactly how I would see it. I have to test. I want to transfer this data to Google map.

Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 07:27:12 am

What is essential:
Your test file contained a BOM, I just deleted it manually. Of course, you can load the data file in another way.
The project uses TVirtualStringTree, you must make sure that you have the virtualtreeview package installed.
And of course, the project uses LGenerics. :)

Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 04:45:14 pm

@mpknap

Avk's program is about as good as it gets. I guess you're all set.

@avk

Nice! thank you.

Title: Re: Sorting and Counting
Post by: avk on November 12, 2019, 05:50:14 pm

In fact, everything is not as good as you say. I just accidentally discovered that the proposed solution cannot display the contents of the last node of the treeview. So I propose a fixed solution.

Title: Re: Sorting and Counting
Post by: 440bx on November 12, 2019, 06:23:44 pm

Quote from: avk on November 12, 2019, 05:50:14 pm

In fact, everything is not as good as you say.

but, you proved again that errare humanum est. ;)

Title: Re: Sorting and Counting
Post by: mpknap on November 13, 2019, 06:20:20 am

I installed LCLEXTENSION and VIRTUALTREEVIEW Package, but when installing LGeneric I have an error,

Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 09:59:04 am

According to your screenshot, it is impossible to determine the version of the compiler used, but I suspect it is 3.0.4.
Quote from the LGenerics readme:

Quote

...In order to use it (FPC 3.3.1 and higher and Lazarus 1.9.0 and higher)...

If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.

Title: Re: Sorting and Counting
Post by: mpknap on November 13, 2019, 06:31:29 pm

Quote

If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.

If you could, it would be nice. This will be open code, and I don't want to oblige other interested parties to install Packages. It's supposed to be the easiest, not necessarily fast.

Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 07:23:40 pm

Ok, basically we only need sorting and binary search algorithms.
A fairly good sorting algorithm is available in fcl-stl. But I had to write BinarySearch from scratch,
I hope that it will work correctly. Let me know if anything goes wrong.

Upd. I forgot to remove the dependency on LGenerics from the project. :-[
I replaced the attachment.

Title: Re: Sorting and Counting
Post by: Thaddy on November 13, 2019, 08:05:38 pm

@avk there's a good search at http://www.martincharvey.net I even have a fpc adaptation if you are interested. But you can also add {$mode delphi} ... :D ;)
I am referring to his binarytree.pas and indexedstore.pas. The latter is the conceptually most interesting.

Title: Re: Sorting and Counting
Post by: howardpc on November 13, 2019, 08:21:10 pm

@avk
If you intended your second sort-count.zip to avoid dependency on LGenerics, did you upload the wrong file?

Title: Re: Sorting and Counting
Post by: avk on November 13, 2019, 08:31:03 pm

@Thaddy, thank you, interesting.
@ howardpc, no, I forgot to remove the dependency, thank you very much.

Title: Re: Sorting and Counting
Post by: 440bx on November 13, 2019, 09:22:19 pm

Quote from: avk on November 13, 2019, 07:23:40 pm

Ok, basically we only need sorting and binary search algorithms.
A fairly good sorting algorithm is available in fcl-stl. But I had to write BinarySearch from scratch,
I hope that it will work correctly. Let me know if anything goes wrong.

Just in case you may be interested and, in addition to what Thaddy above mentioned, under Windows, ntdll provides a typical bsearch function to search a sorted sequence.

I use the following definitions:

Code: Pascal [Select][+]

// -----------------------------------------------------------------------------
// qsort and bsearch related types
 
type
  TCompareFunction = function (key : pointer; data : pointer) : ptrint; cdecl;
 
const
  COMPARE_EQUAL   =  0;
  COMPARE_GREATER =  1;
  COMPARE_LESS    = -1;
 
 
function bsearch(key             : pointer;
                 base            : pointer;
                 num             : ptruint;
                 width           : ptruint;
                 CompareFunction : TCompareFunction) : pointer;
  cdecl; external ntdll;
  // ; void *__cdecl bsearch(const void *Key,
  //                         const void *Base,
  //                             size_t NumOfElements,
  //                             size_t SizeOfElements,
  //                               int (__cdecl *PtFuncCompare)(const void *,
  //                                                            const void *))
 
procedure qsort(base            : pointer;
                num             : ptruint;
                width           : ptruint;
                CompareFunction : TCompareFunction);
  cdecl; external ntdll;
  // ; void __cdecl qsort(void  *Base,
  //                      size_t NumOfElements,
  //                      size_t SizeOfElements,
  //                         int (__cdecl *PtFuncCompare)(const void *,
  //                                                      const void *))
 
 
 

so far, it seems to be bug free ;)

Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 05:02:06 am

@440bx, thanks. However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.

Title: Re: Sorting and Counting
Post by: mpknap on November 14, 2019, 07:44:47 am

AVK, I'm sorry but I still have problem with starting sort_count.lpk. He refers to LGeneric, and he can't find RTTI. I even installed the latest version of Lazarus 2.0.6 FPC 3.0.4. for windows 10/64.

Thank you anyway.

Title: Re: Sorting and Counting
Post by: Thaddy on November 14, 2019, 07:48:23 am

Quote from: mpknap on November 14, 2019, 07:44:47 am

AVK, I'm sorry but I still have problem with starting sort_count.lpk. He refers to LGeneric, and he can't find RTTI. I even installed the latest version of Lazarus 2.0.6 FPC 3.0.4. for windows 10/64.

Thank you anyway.

lgenerics needs 3.2.0 or trunk. Not 3.0.4. avk already wrote that.
The rtti unit is also introduced in 3.2.0. See https://wiki.freepascal.org/FPC_New_Features_3.2#Rtti_unit
I suggest you install 3.2.0. It is stable, feature complete and for major platforms there are ready builds.
And fpcdeluxe can build and install fpc3.2.0+Lazarus 2.0.6 for you.

Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 08:31:18 am

@mpknap, please test this version.

Title: Re: Sorting and Counting
Post by: Thaddy on November 14, 2019, 10:32:58 am

Quote from: avk on November 14, 2019, 08:31:18 am

@mpknap, please test this version.

Back-porting just before a major release? :D :D :D :D
But lgenerics is excellent and a good addition, even replacement, to rtl-generics because it covers a wider scope.
It is not for beginners, though, but you know that.

Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 11:31:44 am

Quote from: avk on November 14, 2019, 05:02:06 am

However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.

Yes, that is definitely an important difference in this case. In such a case, when using bsearch, the resulting index when a match is found, has to be "manually adjusted" to ensure it is the index of the first instance match.

Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 12:24:39 pm

And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?

Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 01:15:19 pm

Quote from: avk on November 14, 2019, 12:24:39 pm

And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?

It normally wouldn't be O(n), it would be (lg N) + (avg_dups_per_key/2).

Where avg_dups is the average number of duplicates per distinct element in the table. Obviously, if that average is large for a large number of unixtimes (in this case) then it will likely be better from a performance viewpoint to create an index of distinct keys (unixtime) and search that index instead. IOW, as the ratio of N/distinct gets larger, performance suffers. In the worst case, for 1 element duplicated N times, then it would be O(N).

Title: Re: Sorting and Counting
Post by: avk on November 14, 2019, 02:20:05 pm

Quote from: 440bx on November 14, 2019, 01:15:19 pm

...In the worst case, for 1 element duplicated N times, then it would be O(N).

yes it is, and for N/2, N/4, ... duplicated elements.

Title: Re: Sorting and Counting
Post by: 440bx on November 14, 2019, 03:31:21 pm

Quote from: avk on November 14, 2019, 02:20:05 pm

Quote from: 440bx on November 14, 2019, 01:15:19 pm
...In the worst case, for 1 element duplicated N times, then it would be O(N).
yes it is, and for N/2, N/4, ... duplicated elements.

Yes but, it depends on how the duplicates are distributed. For instance, consider N elements and all the duplicates are clustered in, just for example, 4 elements. Accessing any of those 4 elements is basically O(n), while accessing any other element is O(lg N). For such a distribution (granted, it is an unusual one), the big O of the totality is neither O(n) nor O(lg N), for this example it would be (2n + (n - 4) * lg n)

In the specific case of tracking times across the world for what may be a fairly common event, the safe approach is to create an index of distinct unixtimes and bsearch that. No undesirable cases that way.

Anyway, your point that the start of the duplicate list isn't returned by a normal bsearch is definitely valid. I just wanted to point out that, if the duplicates are evenly distributed across the n elements _and_ the ratio of N/duplicates isn't very large (say under 10) then, giving the traditional bsearch "a hand" to find the start of the list is reasonable and, ntdll provides one. :)

Title: Re: Sorting and Counting
Post by: mpknap on November 14, 2019, 08:37:57 pm

Quote from: avk on November 14, 2019, 08:31:18 am

@mpknap, please test this version.

Yes Yes Yes!!! It works! And that was what I meant, this kind of sorting and displaying information.

Although there is a small problem, pressing F12 to enter the form shows (attachment 1 fot1.jpg), and after attempting installation shows attachment fot2.jpg.

In any case, the algorithm is ok. At the weekend I will check it.
Thank you once again.

Title: Re: Sorting and Counting
Post by: mpknap on November 15, 2019, 07:47:30 am

I don't understand why you can't install VirtualtreeView, I try in different ways and every time different errors. Even in the latest version of lazarus, OnlinePackageManager does not help.

I can't edit the form without it.

AVK, you can't write it in standard Lazarus packages? ;)

If not, I give up ...

Title: Re: Sorting and Counting
Post by: avk on November 15, 2019, 08:15:11 am

I am glad to see some progress in your efforts, but I do not understand your failure to install VirtualTreeview. I installed VirtualTreeview from /lazarus/components/virtualtreeview. Compilation and installation are performed without any errors.

Title: Re: Sorting and Counting
Post by: mpknap on November 16, 2019, 08:17:02 am

finally, I was able to install. I had two other versions of Lazarus on the disk, I deleted them badly.
everything works, thank you again!

Title: Re: Sorting and Counting
Post by: mpknap on July 01, 2020, 07:38:51 am

Quote from: howardpc on July 17, 2019, 02:13:18 pm

I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal [Select][+][-]
unit mainSortCount;

{$mode objfpc}{$H+}

interface

uses
Classes, SysUtils, Forms, StdCtrls;

type
TForm1 = class(TForm)
Memo1: TMemo;
procedure FormCreate(Sender: TObject);
end;

var
Form1: TForm1;

procedure SortCount(const anInFile: String; out aList: TStringList);

procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);

implementation

{$R *.lfm}

{ TForm1 }

procedure SortCount(const anInFile: String; out aList: TStringList);
const
one = PtrUInt(1);
var
textf: TextFile;
s: String;
idx: Integer;

function GetSuccObj(anIntObj: TObject): TObject;
var
i: PtrUInt absolute anIntObj;
begin
Inc(i);
Exit(anIntObj);
end;

begin
Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
aList := TStringList.Create;
aList.Duplicates := dupError;
aList.Sorted := True;
AssignFile(textf, anInFile);
try
Reset(textf);
while not EOF(textf) do
begin
ReadLn(textf, s);
s := Trim(s);
idx := aList.IndexOf(s);
case idx of
-1: aList.AddObject(s, TObject(one));
else
aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
end;
end;
finally
CloseFile(textf);
end;
end;

procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
var
i: Integer;
begin
if Assigned(aList) and Assigned(aMemo) then
begin
aMemo.Clear;
for i := 0 to aList.Count-1 do
aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
end;
end;

procedure TForm1.FormCreate(Sender: TObject);
var
sl: TStringList;
begin
SortCount('infile.txt', sl);
try
ShowListInMemo(sl, Memo1);
Memo1.Lines.SaveToFile('outfile.txt');
finally
sl.Free;
end;
end;

end.

howardpc question to you ;)

I can't develop your code for the next need.

I want to count duplicates (or multi x3,x4...) from a CSV file according to DATETIME.

As a result (memo1) I want: DateTIME; Count; all Device_ID participating in the duplicate

Format of CSV file :

user_id,device_id,"datetime"
18817,13174,2020-01-09 00:01:14
15190,10604,2020-01-09 00:09:04
15190,10604,2020-01-09 00:09:05
10892,7559,2020-01-04 10:02:21
10892,7559,2020-01-04 10:52:59
10892,7559,2020-01-04 13:56:42
10892,7559,2020-01-04 20:46:01
15190,10604,2020-01-09 00:13:48
15190,10604,2020-01-09 00:13:48
6521,4879,2020-01-09 00:14:53

Title: Re: Sorting and Counting
Post by: howardpc on July 01, 2020, 11:03:18 am

Try the attached project.

Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 07:01:29 am

Quote from: howardpc on July 01, 2020, 11:03:18 am

Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar

Title: Re: Sorting and Counting
Post by: howardpc on July 02, 2020, 10:15:13 am

Quote from: mpknap on July 02, 2020, 07:01:29 am

Works with your CSV file. With my (larger file) it doesn't finish.

I'm not in the least bit surprised it fails on your "larger" file. Your file is 3.4 MB!
The code I offered was written in about an hour, and completely untested except on a single file of size 377 bytes. Your post never indicated that you were trying to analyse data files of several MB.

You can't expect code to scale to encompass data a million times bigger than it has been tested on without needing adjustment.
I don't know in the code I offered if the limitation you encounter is to do with memory or a TStringlist's natural limits (e.g. the Count property is an Integer, not an int64). Nor does it really matter. TStringList is not suited for processing such large scale data. You would be better advised to use a proper database, one designed to cope with with large datasets, rather than relying on a roll-your-own database based on TStringList.

The code I offered assumes it can load all data into memory at one go. For large datasets you need some way to cache the data, and process it in manageable chunks. You can't try to swallow it whole all at once.
In other words your data requires a different algorithm, one designed with the scale of the data in mind, not one simply designed on the basis of the data format.

Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 01:24:56 pm

of your last example is still using the TMEMO then maybe its not able to hold the list. It is of course not intended for large list.

Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 01:27:14 pm

Quote from: mpknap on July 02, 2020, 07:01:29 am

Quote from: howardpc on July 01, 2020, 11:03:18 am
Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar

Please provide a "ZIP" file, I don't process RAR files ..

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 01:33:29 pm

Quote from: mpknap on July 02, 2020, 07:01:29 am

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 01:51:53 pm

If you really only want to count duplicate dates, you might want to add the string to the sorted TStringList with date first. After that you can loop the list, but only check the first 19 characters of the string. DON't work with CommaText etc... it's too slow.

In that case you get something like this:
Note the String.Split(',') which is much much faster.
There is no need for a separate unique date stringlist because we already sorted the list when loading. And because date is now the first entry it is sorted on date (for easy checking). You do need to set Duplicates to dupAccept and Sorted to true then.

Code: Pascal [Select][+]

procedure TForm1.FormCreate(Sender: TObject);
begin
  FOriginalCSV := TStringList.Create;
  try
    FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv');
    CollectDuplicateDates(FOriginalCSV, DuplicatesMemo);
  finally
    FOriginalCSV.Free;
  end;
end;
 
 
procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo);
var
  i, dups: integer;
  dups_string: String;
  deviceids: TStringList;
  A: array of string;
begin
  deviceids := TStringList.Create;
  try
 
    deviceids.Duplicates := dupAccept;
    deviceids.Sorted := True;
 
    // make a sorted list with DATE+TIME as first entry
    for i := 0 to Pred(aCSVList.Count) do
    begin
      A := aCSVList[i].Split(',');
      if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]);
    end;
 
    dups := 0;
    dups_string := '';
    for i := 1 to Pred(deviceids.Count) do
    begin
      // ONLY check date+time entry, first 19 characters
      if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then
      begin
        Inc(dups);
        if dups = 1 then dups_string := deviceids[i - 1];
        dups_string := dups_string + ' // ' + deviceids[i];
      end
      else
      begin
        A := deviceids[i].Split(',');
        if dups > 0 then
          aMemo.Lines.Add('"%s"  count=%d   device ids: "%s"', [A[0], dups + 1, dups_string]);
        dups := 0;
        dups_string := '';
      end;
    end;
 
  finally
    deviceids.Free;
  end;
end;

Title: Re: Sorting and Counting
Post by: howardpc on July 02, 2020, 04:56:26 pm

With rvk's more intelligent algorithm it looks like you can still use TStringList to good effect on multi-MB files within a reasonable time frame.
Time spent on developing a good design before writing a line of code is time well spent, and can lead to cleaner and faster-performing code.
I usually find it easier to adapt/improve other people's code than come up with a winner myself first time. Perhaps this is because I'm just an autodidact hobbyist with no formal training in IT, and I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.

Title: Re: Sorting and Counting
Post by: lucamar on July 02, 2020, 05:26:58 pm

Quote from: howardpc on July 02, 2020, 04:56:26 pm

[..] I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.

You've got plenty company in that, and not only of "autodidact hobbyist"s. Professionals in a hurry tend to go for that too (and I talk from personal experience :-[)

Title: Re: Sorting and Counting
Post by: jamie on July 02, 2020, 06:33:36 pm

@rvk, the loop is one more than it should be 8-)

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 06:45:08 pm

Quote from: jamie on July 02, 2020, 06:33:36 pm

@rvk, the loop is one more than it should be 8-)

In my example or in the original in the .rar?

Of course this could be done even more efficiëntly but I just reacted at the the given code in the .rar (where there is a loop within a loop). And there a TStringList was used. So I build on that.

Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:01:49 pm

Quote from: rvk on July 02, 2020, 01:33:29 pm

Quote from: mpknap on July 02, 2020, 07:01:29 am
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.

I used Writeln (i);
The program works. But it's slow.

Code: Pascal [Select][+]

 writeln (aUniqueDates.Count); 

show over 90,000.

loop :

Code: Pascal [Select][+]

for i := 0 to Pred(aUniqueDates.Count) do

......

does it for over a one second for each step.

That's why it lasts so long. 90,000 x 1 second;)

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 09:04:38 pm

Quote from: mpknap on July 02, 2020, 09:01:49 pm

That's why it lasts so long. 90,000 x 1 second;)

Look a few posts back. I showed you code which does it in less then 5 seconds.

Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:10:20 pm

Quote from: rvk on July 02, 2020, 01:51:53 pm

If you really only want to count duplicate dates,

Yes. I'm only interested in duplets, triplets and more ....

More precisely, it's about finding triplets and looking for a second (similar) triplet for the same DEVICE_ID in close proximity, e.g. 5 minutes.
It's complicated to describe because there are no exact directions. I run blind research ;)

Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 09:33:26 pm

uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:

Code: SQL [Select][+]

.mode csv
.import pewniacy_odstycznia_true.csv dupdup
.schema dupdup
CREATE TABLE dupdup(
  "user_id" TEXT,
  "device_id" TEXT,
  "datetime" TEXT
);
SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
 

Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)

Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:33:36 pm

Quote from: rvk on July 02, 2020, 09:04:38 pm

Quote from: mpknap on July 02, 2020, 09:01:49 pm
That's why it lasts so long. 90,000 x 1 second;)
Look a few posts back. I showed you code which does it in less then 5 seconds.

thx. really fast :)

Title: Re: Sorting and Counting
Post by: mpknap on July 02, 2020, 09:47:46 pm

Quote from: TRon on July 02, 2020, 09:33:26 pm

uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:
Code: SQL [Select][+][-]
.mode csv
.import pewniacy_odstycznia_true.csv dupdup
.schema dupdup
CREATE TABLE dupdup(
"user_id" TEXT,
"device_id" TEXT,
"datetime" TEXT
);
SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;

Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)

I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.

If you can do it, it will be great :)

Yes Sqlite ist faster then all ;)

Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 09:53:08 pm

Quote from: mpknap on July 02, 2020, 09:47:46 pm

I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.

If you are seriously thinking of using the data as sql dataset in your program, then i can try set it up here at my end.

The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"

The output from the statement as in my previous post displays the deviceid. Do youo mean the event can happen on the exact same date-time but using another device-id ? e.g. you need a (additional) distinction between device-id's ?

Title: Re: Sorting and Counting
Post by: rvk on July 02, 2020, 10:09:24 pm

Quote from: mpknap on July 02, 2020, 09:47:46 pm

Yes Sqlite ist faster then all ;)

You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.

In my example I used TStringList sorted = true and added all the lines to get a sorted list. After that I did a second loop to detect the duplicates. Both steps could be pulled together to make it even faster.

But I like the SQLite approach too. You can create an SQL statement to get the duplicate devices. You can even expand it to give duplicates in 5 minutes of each other like you mentioned (although that will become a somewhat advanced sql :) ).

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.

Title: Re: Sorting and Counting
Post by: TRon on July 02, 2020, 10:58:16 pm

Quote from: rvk on July 02, 2020, 10:09:24 pm

You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.

In howardpc's and your defence (probably others as well, I haven't read the whole thread), initially you people had to work with the sample data (which in hindsight wasn't a good representation of the actual situation)

@mpknap
And of course there are the exceptions. If you are going to convert a decades old database that needs an upgrade, do you prefer speed over consistency ? I don't and will choose the slower more precise solution over any speedy one that might be sloppy. Time is usually of no concern in such cases (and if there is then those that put on the time-restriction can go play with themselves, as they had decades to think about their sloppy maintenance) (*)

Quote from: rvk on July 02, 2020, 10:09:24 pm

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.

Exactly.

@mpknap
Especially if it concerns a project that is actually a little over your head (for whatever reason). I usually look at the bigger picture first, write down that flow of the program and then try to split up the big(ger) chunks of the program in parts that I am still able to take on, again on paper. In case I need to use techniques (or topics) that I have never dealt with before, I can check those first in a (small) test-program before incorporating such parts into the final program. It's a balance between keeping the bigger picture in mind while at the same time working on detailed implementations. The more work you do beforehand (on paper) the easier it becomes to actually implement the code.

(*) Therefor, also in relation to what I wrote directly above, it is also important to think about things like speed beforehand. Sometimes it matters, sometimes it won't. More is not always better :D

Title: Re: Sorting and Counting
Post by: mpknap on July 03, 2020, 07:36:42 am

Quote from: rvk on July 02, 2020, 10:09:24 pm

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.

I understand everything you want to advise me. And thank you.

The point is that nobody knows what we're looking for and how to find it. These are just my guesses.
This can be compared to "looking for rain drops that will fall on three flowers in a large garden at the same time. If this event occurs again within a few minutes (at least once), there is a suspicion of success."

It's about searching for cascades of cosmic rays.
I conduct research on my own, in my free time. I'm not a programmer, engineer or scientist that's why so much chaos in my questions;)

And my progress is only thanks to you and your knowledge.

Quote

The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"

TRON.
If the value in the DUP column is, for example, "2", we still do not know which Device_ID make up it. There would have to be an additional column in which it will show Devices numbers ... See screen in the attachment

Title: Re: Sorting and Counting
Post by: 440bx on July 03, 2020, 08:03:42 am

Quote from: mpknap on July 03, 2020, 07:36:42 am

The point is that nobody knows what we're looking for and how to find it.

Just a very general comment. When you believe there might be something to be found in the data but don't even know what, that's when SQL databases are great. (not the only thing they are excellent for but, that's one of them.)

SQL allows you to "play" with the data with little effort. Of course, you'll get the most of out of it by acquiring a fairly decent level of knowledge in SQL. Fortunately, SQL is quite easy and there are a lot of forums with helpful users willing to help when you hit a brick wall.

Personally, I like Postgres but, when it comes to user support, some of the users that participate in the Oracle SQL forum are literally incredible. Both are extremely capable DBMS systems and they'll allow you to look at data just about every which way you want to look at it, in just a few lines of SQL.

In long winded way, what I'm saying is that I probably wouldn't use Pascal for what you're doing. I'd use something that requires less work to try random things on the fly.

Title: Re: Sorting and Counting
Post by: TRon on July 03, 2020, 08:05:02 am

Quote from: mpknap on July 03, 2020, 07:36:42 am

The point is that nobody knows what we're looking for and how to find it. These are just my guesses.

I have no idea what your program is suppose to be doing as a final result or how this should be presented to the user, since you are the programmer you are the one in control. So, yes unless you do not know what you wish to achieve/accomplish in the end then we do not know either ;)

Quote

This can be compared to "looking for rain drops that will fall on three flowers in a large garden at the same time. If this event occurs again within a few minutes (at least once), there is a suspicion of success."

The answer to that question is 42 btw.

Quote

If the value in the DUP column is, for example, "2", we still do not know which Device_ID make up it. There would have to be an additional column in which it will show Devices numbers ... See screen in the attachment

Yeah, and that is impossible to realise because the dupcount is/can be made up of multiple Device_ID's.

Come to think about it, why do you need a duplicate-count to begin with ? imho it isn't helpful at all to know the number of duplicates, unless you have a specific purpose for it ? (which is currently unknown to us, or at least to me).

afaik this is how you manage to create a list of unique duplicates (assuming data is the name of the SQL table):

Code: SQL [Select][+]

SELECT DISTINCT t1.datetime, t1.device_id FROM DATA AS t1 INNER JOIN DATA AS t2 ON t1.datetime = t2.datetime WHERE t1.device_id <> t2.device_id ORDER BY t1.datetime, CAST(t1.device_id AS INTEGER);
 

And, again afaik, this creates a list of the duplicate items.

Code: SQL [Select][+]

SELECT t1.datetime, t1.device_id, t1.user_id FROM DATA AS t1 INNER JOIN (SELECT datetime, device_id, COUNT(*) AS dupcount FROM DATA GROUP BY datetime, device_id HAVING dupcount > 1) AS t2 ON t1.datetime = t2.datetime AND t1.device_id = t2.device_id ORDER BY t1.datetime, t1.device_id;
 

Both show the device_id's that have a duplicate datetime field.

edit: and that picture ... is exactly the kind of cascading that makes those rays act in chaos.... you will never see those figures, at least not by the provided raindrops ;D

Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 04:36:52 pm

Quote from: rvk on July 02, 2020, 01:51:53 pm

Code: Pascal [Select][+][-]
procedure TForm1.FormCreate(Sender: TObject);
begin
FOriginalCSV := TStringList.Create;
try
FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv');
CollectDuplicateDates(FOriginalCSV, DuplicatesMemo);
finally
FOriginalCSV.Free;
end;
end;

procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo);
var
i, dups: integer;
dups_string: String;
deviceids: TStringList;
A: array of string;
begin
deviceids := TStringList.Create;
try

deviceids.Duplicates := dupAccept;
deviceids.Sorted := True;

// make a sorted list with DATE+TIME as first entry
for i := 0 to Pred(aCSVList.Count) do
begin
A := aCSVList[i].Split(',');
if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]);
end;

dups := 0;
dups_string := '';
for i := 1 to Pred(deviceids.Count) do
begin
// ONLY check date+time entry, first 19 characters
if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then
begin
Inc(dups);
if dups = 1 then dups_string := deviceids[i - 1];
dups_string := dups_string + ' // ' + deviceids[i];
end
else
begin
A := deviceids[i].Split(',');
if dups > 0 then
aMemo.Lines.Add('"%s" count=%d device ids: "%s"', [A[0], dups + 1, dups_string]);
dups := 0;
dups_string := '';
end;
end;

finally
deviceids.Free;
end;
end;

RVK.
I need one more condition in your code.

I want them to be displayed in Memo, only records where DeviceID are not the same.
If there are 3 dup for DateTime and DeviceID are 3 times the same then we reject it.
I try to do it myself but it doesn't work out.

This will bring me closer to finding "raindrops";)

Title: Re: Sorting and Counting
Post by: rvk on July 05, 2020, 05:48:42 pm

Quote from: mpknap on July 05, 2020, 04:36:52 pm

I need one more condition in your code.

I want them to be displayed in Memo, only records where DeviceID are not the same.
If there are 3 dup for DateTime and DeviceID are 3 times the same then we reject it.
I try to do it myself but it doesn't work out.

This will bring me closer to finding "raindrops";)

Is you user-id always the same as device-id on the same date+time?
In that case my initial thought was correct and you can just match the entire line during sorting. Set duplicates to dupIgnore and same user,device,datetimes are ignored.

So this should be sufficient

Code: Pascal [Select][+]

deviceids.Duplicates := dupIgnore;

Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 07:06:01 pm

Quote from: rvk on July 05, 2020, 05:48:42 pm

Is you user-id always the same as device-id on the same date+time?
In that case my initial thought was correct and you can just match the entire line during sorting. Set duplicates to dupIgnore and same user,device,datetimes are ignored.

Yes . UserID and DeciceID are the same and repeat themselves.

UserID is the user number.
DeviceID is the smartphone number for the user. Users can have multiple devices.

A flower in the garden is just a smartphone.
And the raindrop is the detection of cosmic radiation in the phone.

All data comes from the CREDO IFJ Poland project. :)

I am looking for whether two phones will catch radiation detection in the same second, and whether they will repeat themselves in a short interval of time for the same devices. If they are found, there is suspicion of "Air Shower"

https://en.wikipedia.org/wiki/Air_shower_(physics)
8)

Quote

So this should be sufficient
Code: Pascal [Select][+][-]
deviceids.Duplicates := dupIgnore;

Its work :)

Title: Re: Sorting and Counting
Post by: TRon on July 05, 2020, 08:39:40 pm

Quote from: mpknap on July 05, 2020, 07:06:01 pm

Its work :)

So does

Code: SQL [Select][+]

SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;

It still doesn't mean that picture of yours is reproducible or representable for your data ... ;)

Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 08:51:06 pm

Quote from: TRon on July 05, 2020, 08:39:40 pm

Quote from: mpknap on July 05, 2020, 07:06:01 pm
Its work :)
So does
Code: SQL [Select][+][-]
SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
It still doesn't mean that picture of yours is reproducible or representable for your data ... ;)

You are genius!!!! You don't even know how much I was looking for! Revelation :)
Thx!!!

Title: Re: Sorting and Counting
Post by: TRon on July 05, 2020, 09:47:38 pm

Quote from: mpknap on July 05, 2020, 08:51:06 pm

You are genius!!!!

Albert Einstein was a genius, so was Stephen Hawking. I even consider those that work on compilers such as Free Pascal, or an IDE as Lazurus genius. I am merely someone ploughing my way through boring documentation and attempt to apply what I've just read :)

The sad part about it really is that I've been sitting on that since I have edited my reply at #197 but, wasn't able to share because you were still at an intermediate step/position in your quest for an answer (one wrong turn heading towards your destination w/could have rendered that SQL line completely useless, in which case I would have had to read even more boring documentation :D ).

Quote

You don't even know how much I was looking for! Revelation :)

I'm pleased to learn that it is useful for you.

I wish you much happy raindrops and balanced cosmic rays !

Title: Re: Sorting and Counting
Post by: mpknap on July 05, 2020, 10:47:52 pm

[/quote]

Albert Einstein was a genius, so was Stephen Hawking.
[/quote]
iam Bob. Bob the Builder ;) Simple worker. I love Pascal :)

Title: Re: Sorting and Counting
Post by: mpknap on July 11, 2020, 08:27:32 pm

Quote from: TRon on July 05, 2020, 08:39:40 pm

Quote from: mpknap on July 05, 2020, 07:06:01 pm
Its work :)
So does
Code: SQL [Select][+][-]
SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
It still doesn't mean that picture of yours is reproducible or representable for your data ... ;)

Welcome back. ;)
Ultimately, I'm using the code in this form:

Code: MySQL [Select][+]

[code=mysql]SELECT  datetime(timestamp/1000,'unixepoch') as czas,  
COUNT(datetime(timestamp/1000,'unixepoch')) AS dupes, 
GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids 
FROM detections 
 
GROUP BY datetime(timestamp/1000,'unixepoch') HAVING dupes > 2 
 
        AND instr(dup_ids, ',') >0

[/code]
It works great, but I thought about something.
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).

Because in this way I can have a triplet shown but different user_ID. It shows your code but also shows most triplets where there really are 2 users which is also correct.

In the JPG attachment with explanation ;)

Title: Re: Sorting and Counting
Post by: TRon on July 11, 2020, 09:38:13 pm

Quote from: mpknap on July 11, 2020, 08:27:32 pm

You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).

Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).

However, this is starting to turn into a string manipulation contest. At least SQLite was not designed to do such things (in an easy manner). Other SQL databases perhaps might though.

Have you considered creating your own custom function(s) using Pascal ? see also: http://www.sqlite.org/c3ref/create_function.html as unfortunately SQLite does not seem to support the statement "create function".

Quote

In the JPG attachment with explanation ;)

Just for the record. In the dataset you shared with us, there is no such data. In that selection, there doesn't seem to exist any data that matches the criteria with having ~~more than 3~~ 3 or more distinct ID's.

edit: stupid typo.

Title: Re: Sorting and Counting
Post by: rvk on July 11, 2020, 10:25:33 pm

Quote from: TRon on July 11, 2020, 09:38:13 pm

Quote from: mpknap on July 11, 2020, 08:27:32 pm
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).
Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).

Something with having count(DISTINCT device_id) > 2 or likewise???

So (but I can't test this)

Code: SQL [Select][+]

SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
COUNT(datetime(TIMESTAMP/1000,'unixepoch')) AS dupes,
GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
FROM detections
GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
HAVING dupes > 2 AND COUNT(DISTINCT device_id) > 2

I'm not sure you even need the dupes column then??

Code: SQL [Select][+]

SELECT  datetime(TIMESTAMP/1000,'unixepoch') AS czas,  
GROUP_CONCAT( DISTINCT device_id  ) AS dup_ids
FROM detections
GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
HAVING COUNT(DISTINCT device_id) > 2

Title: Re: Sorting and Counting
Post by: TRon on July 11, 2020, 11:16:04 pm

Quote from: rvk on July 11, 2020, 10:25:33 pm

Something with having count(DISTINCT device_id) > 2 or likewise???

Interesting.

I didn't know you where allowed to do that, thank you rvk.

Seems to work like a charm.

So, your first statement does the job, the second one (and I agree that it is tempting to want to try) unfortunately seem to include more results then the original statement we started out with. I haven't been able to determine which data exactly it concerns (so unable to tell why).

Other than that I am also unable to test it further as the provided dataset does not contain any data matching the criteria.

Title: Re: Sorting and Counting
Post by: TRon on July 12, 2020, 12:17:30 am

Quote from: TRon on July 11, 2020, 11:16:04 pm

...the second one (and I agree that it is tempting to want to try) unfortunately seem to include more results then the original statement we started out with. I haven't been able to determine which data exactly it concerns (so unable to tell why).

Oh, wait... seems I made an error in my verification SQL statement there. :-[ Sorry about that.

Yes, your second solution @rvk seems to work also.

@mpknap: as stated before: Change the objective and you can start redesigning your statement(s) ;)

Title: Re: Sorting and Counting
Post by: mpknap on July 12, 2020, 12:04:50 pm

Quote

Code: SQL [Select][+][-]
SELECT datetime(TIMESTAMP/1000,'unixepoch') AS czas,
COUNT(datetime(TIMESTAMP/1000,'unixepoch')) AS dupes,
GROUP_CONCAT( DISTINCT device_id ) AS dup_ids
FROM detections
GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
HAVING dupes > 2 AND COUNT(DISTINCT device_id) > 2

I'm not sure you even need the dupes column then??
Code: SQL [Select][+][-]
SELECT datetime(TIMESTAMP/1000,'unixepoch') AS czas,
GROUP_CONCAT( DISTINCT device_id ) AS dup_ids
FROM detections
GROUP BY datetime(TIMESTAMP/1000,'unixepoch')
HAVING COUNT(DISTINCT device_id) > 2
??

Yes! Both codes give the same and correct results. Now the results are clear and transparent .Thanks :)

Quote

Other than that I am also unable to test it further as the provided dataset does not contain any data matching the criteria.

Quote

@mpknap: as stated before: Change the objective and you can start redesigning your statement(s) ;)

Yes, I know, but unfortunately the original SQLITE file is 4.5GB.

Quote

Have you considered creating your own custom function(s) using Pascal ? see also: http://www.sqlite.org/c3ref/create_function.html as unfortunately SQLite does not seem to support the statement "create function".

Interesting, but possible in Pascal?

Title: Re: Sorting and Counting
Post by: Thaddy on July 12, 2020, 12:31:14 pm

Quote from: mpknap on July 12, 2020, 12:04:50 pm

Interesting, but possible in Pascal?

Of course. I use that all the time for my special needs... Mind the cdecl for your external libraries, though.

Title: Re: Sorting and Counting
Post by: rvk on July 12, 2020, 12:37:53 pm

Quote from: mpknap on July 12, 2020, 12:04:50 pm

Quote
@mpknap: as stated before: Change the objective and you can start redesigning your statement(s) ;)
Yes, I know, but unfortunately the original SQLITE file is 4.5GB.

Yikes. That's more than the 104.000 lines you gave before. That illustrates the point extra. You should have mentioned that at the beginning. A simple one TStringList solution with sorting in memory isn't really feasible in that case and we would have suggested a DB solution from the beginning.

Title: Re: Sorting and Counting
Post by: TRon on July 12, 2020, 07:07:11 pm

Quote from: mpknap on July 12, 2020, 12:04:50 pm

Interesting, but possible in Pascal?

As Thaddy already wrote, yes

For an example see FreePascal package fcl-db/examples/sqlite3extdemo.pp (and accompanied myext.pp)

Quote from: mpknap on July 12, 2020, 12:04:50 pm

Yes, I know, but unfortunately the original SQLITE file is 4.5GB.

Ah, the final requirements/conditions. It took only #210 posts ;)

As rvk already wrote, a vital piece if information that should have been mentioned from the start imho. Even in case you are not able to share (all) the data. It just makes it a bit more difficult to verify (at least in my case, since i'm fairly new to SQLite).