Recent

Author Topic: Sorting and Counting  (Read 8378 times)

mpknap

  • Jr. Member
  • **
  • Posts: 92
Sorting and Counting
« on: July 16, 2019, 07:35:20 am »
Hello.
I have a text file and in it unixtime times,
this is how it looks:

file.txt
1562551080

1562551140

1562551260

1562551260

1562551260

1562551260

1562551320

1562551380
.......


The file is very large. about 20MB.

Many lines are repeated, others are only once. The task that I want to do is to calculate how many times it is the same. So make an array with Unix time and the number of repetitions for a given time.

After sorting and calculation it should be like this:
1562551080 - 1

1562551140 - 1

1562551260 - 4

1562551320 -1

1562551380 -1
......


Any ideas?

mangakissa

  • Hero Member
  • *****
  • Posts: 947
Re: Sorting and Counting
« Reply #1 on: July 16, 2019, 08:17:48 am »
You give the solution by yourself; a two dimensional array.
Lazarus 1.84 (32b) / FPC 3.0.4
Windows 10

julkas

  • Sr. Member
  • ****
  • Posts: 424
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #2 on: July 16, 2019, 09:07:00 am »
Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #3 on: July 16, 2019, 09:11:17 am »
A data structure called "multiset" is well suited to your task.

mangakissa

  • Hero Member
  • *****
  • Posts: 947
Re: Sorting and Counting
« Reply #4 on: July 16, 2019, 04:20:30 pm »
Possible approach - ordered associative array with uniq keys.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf
FGL is standard with TFPGMap in it.

But this is easy to do with an 2 dimensional array. After finding the doubles, you can sort in every way yo want.
Lazarus 1.84 (32b) / FPC 3.0.4
Windows 10

jamie

  • Hero Member
  • *****
  • Posts: 2140
Re: Sorting and Counting
« Reply #5 on: July 16, 2019, 10:47:22 pm »
Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

Number 1 at blue screen app creations!

winni

  • Hero Member
  • *****
  • Posts: 549
Re: Sorting and Counting
« Reply #6 on: July 16, 2019, 11:29:05 pm »
Yes, as Jamie said:

* load it into a stringlist
* set the stringlist.sorted to true - and wait, 20 MB mlight take some time
* count the duplicates
* and now: write them with your format: ( time - count  ) into a second stringlist
* save it to disk - done

In case of Linux/Unix: this goes really faster with the shell and sort and uniq …….....

Winni

mpknap

  • Jr. Member
  • **
  • Posts: 92
Re: Sorting and Counting
« Reply #7 on: July 17, 2019, 07:00:05 am »
Why do I think this is home work ?

In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.

In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby. In life I am Bob the Builder :). I like to write something in my spare time, probably only to prove myself. I am interested in the CREDO project, I use their data. This time I want to juxtapose them with the detection of global earthquakes, I wonder if it has any connection, although nobody writes about it ;).

Here are some tools for the CREDO project https://github.com/credo-science/Windows-Tools

And the CREDO project itself here: https://credo.science/


Yes, I started, but I do not know how to count duplicates and how to do an array with the results. I have so much code for now. Good start?

Code: Pascal  [Select]
  1. procedure TForm1.Button1Click(Sender: TObject);
  2. const
  3.   BLOCK_SIZE = 10000;
  4.  
  5. var
  6.   f: textfile;
  7.   Times: TStringList;
  8.   det, separator: string;
  9.   i: integer;
  10.  
  11. begin
  12.   czas := TStringList.Create;
  13.   czas.Duplicates := dupIgnore;
  14.   czas.Sorted := True;
  15.  
  16.   AssignFile(f, 'Base_total.txt');
  17.   reset(f);
  18.  
  19.   SetLength(det, 0);
  20.   FillChar(det, SizeOf(det), 0);
  21.   while not EOF(f) do
  22.   begin
  23.  
  24.     if i mod BLOCK_SIZE = 0 then
  25.       SetLength(det, Length(det) + BLOCK_SIZE);
  26.  
  27.     readln(f, det);
  28.     czas.Add(det);
  29.     readln(f, separator);
  30.     Inc(i);
  31.   end;
  32.   closefile(f);
  33. end;    
« Last Edit: July 17, 2019, 07:12:51 am by mpknap »

engkin

  • Hero Member
  • *****
  • Posts: 2513
Re: Sorting and Counting
« Reply #8 on: July 17, 2019, 08:47:31 am »
Maybe this:
Code: Pascal  [Select]
  1. procedure CountDuplicates(AFileName, AResultFileName: String);
  2. var
  3.   List, Res: TStringList;
  4.   Item: String = '';
  5.   Count: integer = 0;
  6.   i: Integer = 0;
  7. begin
  8.   List := TStringList.Create;
  9.   Res := TStringList.Create;
  10.   try
  11.     List.LoadFromFile(AFileName);
  12.     List.Sorted := True;
  13.     while i<List.Count do
  14.     begin
  15.       // Get current item
  16.       Item := List[i];
  17.  
  18.       // Count similar items
  19.       Count := 0;
  20.       while (i<List.Count) and (List[i]=Item) do
  21.       begin
  22.         inc(i);
  23.         inc(count);
  24.       end;
  25.  
  26.       // Add the result
  27.       Res.Add(Format('%s - %d', [Item, Count]));
  28.     end;
  29.  
  30.     Res.SaveToFile(AResultFileName);
  31.   finally
  32.     List.Free;
  33.     Res.Free;
  34.   end;
  35. end;
  36.  

To use it:
Code: Pascal  [Select]
  1.   CountDuplicates('in.txt','out.txt');

The result would be saved in out.txt file.
Notice that the first result is going to be the number of empty lines, if any.
« Last Edit: July 17, 2019, 08:51:37 am by engkin »

mangakissa

  • Hero Member
  • *****
  • Posts: 947
Re: Sorting and Counting
« Reply #9 on: July 17, 2019, 08:53:44 am »
Code: Pascal  [Select]
  1. type
  2.  
  3.   TMyTime = packed record
  4.     Unixtime : string;
  5.     Counter  : word;
  6.   end;
  7.  
  8. implementation
  9.  
  10. procedure TForm1.Button1Click(Sender: TObject);
  11. var f      : textfile;
  12.     myline : string;
  13.     MyTime : array of TMyTime;
  14.     index  : integer;
  15. begin
  16.   assignfile(f,'file.txt');
  17.   reset(f);
  18.   index := 0;
  19.   while not eof(f) do
  20.   begin
  21.     readln(f,myline);
  22.     if not find(MyTime, myline) then
  23.     begin
  24.       index := index + 1;
  25.       setlength(MyTime,index);
  26.       MyTime[index - 1].Unixtime := myLine;
  27.       MyTime[index - 1].Counter := 1;
  28.     end;
  29.   end;
  30.   closefile(f);
  31.   //use bubblesort, quicksort, heapsort or other sort
  32.   for index := 0 to length(myTime) - 1 do
  33.     memo1.Lines.add(format('time : %s   duplicates : %3d',[myTime[index].Unixtime,myTime[index].Counter]));
  34. end;
  35.  
  36. function TForm1.Find(var aMyTime : array of TMytime; const aLine : string) : boolean;
  37. var index : integer;
  38. begin
  39.   result := false;
  40.   if length(aMyTime) > 0 then
  41.   begin
  42.     for index := low(aMyTime) to high(aMyTime) do
  43.     begin
  44.       if aMyTime[index].Unixtime = aLine then
  45.       begin
  46.         aMyTime[index].Counter := aMyTime[index].Counter + 1;
  47.         result := true;
  48.         break;
  49.       end;
  50.     end;
  51.   end;
  52. end;
  53.  
  54. end.
  55.  
stringlist is nice but for small files. and uses a lot of resources for nothing. Okay, things already built in like sort, but in this way you actually see what you're doing.
Lazarus 1.84 (32b) / FPC 3.0.4
Windows 10

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #10 on: July 17, 2019, 01:32:40 pm »
Well, I'm glad to see that I'm not the only one who believes that TStringList is not quite suitable for this task (if only because unixtime is a number, and sorting numbers is much faster than sorting strings). However, the task of counting duplicate values ​​is fairly common, and the standard library should have a ready solution for it.

howardpc

  • Hero Member
  • *****
  • Posts: 3197
Re: Sorting and Counting
« Reply #11 on: July 17, 2019, 02:13:18 pm »
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal  [Select]
  1. unit mainSortCount;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.    Classes, SysUtils, Forms, StdCtrls;
  9.  
  10. type
  11.    TForm1 = class(TForm)
  12.      Memo1: TMemo;
  13.      procedure FormCreate(Sender: TObject);
  14.    end;
  15.  
  16. var
  17.    Form1: TForm1;
  18.  
  19.   procedure SortCount(const anInFile: String; out aList: TStringList);
  20.  
  21.   procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  22.  
  23. implementation
  24.  
  25. {$R *.lfm}
  26.  
  27. { TForm1 }
  28.  
  29. procedure SortCount(const anInFile: String; out aList: TStringList);
  30. const
  31.   one = PtrUInt(1);
  32. var
  33.   textf: TextFile;
  34.   s: String;
  35.   idx: Integer;
  36.  
  37.   function GetSuccObj(anIntObj: TObject): TObject;
  38.   var
  39.     i: PtrUInt absolute anIntObj;
  40.   begin
  41.     Inc(i);
  42.     Exit(anIntObj);
  43.   end;
  44.  
  45. begin
  46.   Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  47.   aList := TStringList.Create;
  48.   aList.Duplicates := dupError;
  49.   aList.Sorted := True;
  50.   AssignFile(textf, anInFile);
  51.   try
  52.     Reset(textf);
  53.     while not EOF(textf) do
  54.       begin
  55.         ReadLn(textf, s);
  56.         s := Trim(s);
  57.         idx := aList.IndexOf(s);
  58.         case idx of
  59.           -1: aList.AddObject(s, TObject(one));
  60.           else
  61.             aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
  62.         end;
  63.       end;
  64.   finally
  65.     CloseFile(textf);
  66.   end;
  67. end;
  68.  
  69. procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  70. var
  71.   i: Integer;
  72. begin
  73.   if Assigned(aList) and Assigned(aMemo) then
  74.     begin
  75.       aMemo.Clear;
  76.       for i := 0 to aList.Count-1 do
  77.         aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
  78.     end;
  79. end;
  80.  
  81. procedure TForm1.FormCreate(Sender: TObject);
  82. var
  83.   sl: TStringList;
  84. begin
  85.   SortCount('infile.txt', sl);
  86.   try
  87.     ShowListInMemo(sl, Memo1);
  88.     Memo1.Lines.SaveToFile('outfile.txt');
  89.   finally
  90.     sl.Free;
  91.   end;
  92. end;
  93.  
  94. end.

avk

  • Full Member
  • ***
  • Posts: 162
    • my self-education project
Re: Sorting and Counting
« Reply #12 on: July 17, 2019, 03:20:20 pm »
Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's  one(22 s.)  and it is prohibitively slow for a task of this size.

julkas

  • Sr. Member
  • ****
  • Posts: 424
  • KISS principle / Lazarus 2.0.0 / FPC 3.0.4
Re: Sorting and Counting
« Reply #13 on: July 17, 2019, 05:29:50 pm »
GMap is fast enough. I will post my random test tomorrow.
procedure mulu64(a, b: QWORD; out clo, chi: QWORD); assembler;
asm
  mov rax, a
  mov rdx, b
  mul rdx
  mov [clo], rax
  mov [chi], rdx
end;

marcov

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 7575
Re: Sorting and Counting
« Reply #14 on: July 17, 2019, 05:56:51 pm »
gmap might depend on the number of unique values. If that gets large it might also slow down.