Recent

Author Topic: Sorting and Counting  (Read 15753 times)

440bx

  • Hero Member
  • *****
  • Posts: 1994
Re: Sorting and Counting
« Reply #165 on: November 14, 2019, 11:31:44 am »
However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.
Yes, that is definitely an important difference in this case. In such a case, when using bsearch, the resulting index when a match is found, has to be "manually adjusted" to ensure it is the index of the first instance match.



FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

avk

  • Sr. Member
  • ****
  • Posts: 295
    • my self-education project
Re: Sorting and Counting
« Reply #166 on: November 14, 2019, 12:24:39 pm »
And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?

440bx

  • Hero Member
  • *****
  • Posts: 1994
Re: Sorting and Counting
« Reply #167 on: November 14, 2019, 01:15:19 pm »
And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?
It normally wouldn't be O(n), it would be (lg N) + (avg_dups_per_key/2).

Where avg_dups is the average number of duplicates per distinct element in the table.  Obviously, if that average is large for a large number of unixtimes (in this case) then it will likely be better from a performance viewpoint to create an index of distinct keys (unixtime) and search that index instead.  IOW, as the ratio of N/distinct gets larger, performance suffers.  In the worst case, for 1 element duplicated N times, then it would be O(N).


« Last Edit: November 14, 2019, 01:30:23 pm by 440bx »
FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

avk

  • Sr. Member
  • ****
  • Posts: 295
    • my self-education project
Re: Sorting and Counting
« Reply #168 on: November 14, 2019, 02:20:05 pm »
...In the worst case, for 1 element duplicated N times, then it would be O(N).
yes it is, and for N/2, N/4, ... duplicated elements.

440bx

  • Hero Member
  • *****
  • Posts: 1994
Re: Sorting and Counting
« Reply #169 on: November 14, 2019, 03:31:21 pm »
...In the worst case, for 1 element duplicated N times, then it would be O(N).
yes it is, and for N/2, N/4, ... duplicated elements.
Yes but, it depends on how the duplicates are distributed.  For instance, consider N elements and all the duplicates are clustered in, just for example, 4 elements.  Accessing any of those 4 elements is basically O(n), while accessing any other element is O(lg N).  For such a distribution (granted, it is an unusual one), the big O of the totality is neither O(n) nor O(lg N), for this example it would be (2n + (n - 4) * lg n)

In the specific case of tracking times across the world for what may be a fairly common event, the safe approach is to create an index of distinct unixtimes and bsearch that.   No undesirable cases that way.

Anyway, your point that the start of the duplicate list isn't returned by a normal bsearch is definitely valid.  I just wanted to point out that, if the duplicates are evenly distributed across the n elements _and_ the ratio of N/duplicates isn't very large (say under 10) then, giving the traditional bsearch "a hand" to find the start of the list is reasonable and, ntdll provides one. :)




FPC v3.0.4 and Lazarus 1.8.2 on Windows 7 64bit.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #170 on: November 14, 2019, 08:37:57 pm »
@mpknap, please test this version.
Yes Yes Yes!!! It works! And that was what I meant, this kind of sorting and displaying information.

Although there is a small problem, pressing F12 to enter the form shows (attachment 1 fot1.jpg), and after attempting installation shows attachment fot2.jpg.

In any case, the algorithm is ok. At the weekend I will check it.
Thank you once again.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #171 on: November 15, 2019, 07:47:30 am »
I don't understand why you can't install VirtualtreeView, I try in different ways and every time different errors. Even in the latest version of lazarus, OnlinePackageManager does not help.

I can't edit the form without it.

AVK, you can't write it in standard Lazarus packages? ;)

If not, I give up ...

avk

  • Sr. Member
  • ****
  • Posts: 295
    • my self-education project
Re: Sorting and Counting
« Reply #172 on: November 15, 2019, 08:15:11 am »
I am glad to see some progress in your efforts, but I do not understand your failure to install VirtualTreeview. I installed VirtualTreeview from /lazarus/components/virtualtreeview. Compilation and installation are performed without any errors.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #173 on: November 16, 2019, 08:17:02 am »
finally, I was able to install. I had two other versions of Lazarus on the disk, I deleted them badly.
everything works, thank you again!

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #174 on: July 01, 2020, 07:38:51 am »
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
Code: Pascal  [Select][+][-]
  1. unit mainSortCount;
  2.  
  3. {$mode objfpc}{$H+}
  4.  
  5. interface
  6.  
  7. uses
  8.    Classes, SysUtils, Forms, StdCtrls;
  9.  
  10. type
  11.    TForm1 = class(TForm)
  12.      Memo1: TMemo;
  13.      procedure FormCreate(Sender: TObject);
  14.    end;
  15.  
  16. var
  17.    Form1: TForm1;
  18.  
  19.   procedure SortCount(const anInFile: String; out aList: TStringList);
  20.  
  21.   procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  22.  
  23. implementation
  24.  
  25. {$R *.lfm}
  26.  
  27. { TForm1 }
  28.  
  29. procedure SortCount(const anInFile: String; out aList: TStringList);
  30. const
  31.   one = PtrUInt(1);
  32. var
  33.   textf: TextFile;
  34.   s: String;
  35.   idx: Integer;
  36.  
  37.   function GetSuccObj(anIntObj: TObject): TObject;
  38.   var
  39.     i: PtrUInt absolute anIntObj;
  40.   begin
  41.     Inc(i);
  42.     Exit(anIntObj);
  43.   end;
  44.  
  45. begin
  46.   Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"');
  47.   aList := TStringList.Create;
  48.   aList.Duplicates := dupError;
  49.   aList.Sorted := True;
  50.   AssignFile(textf, anInFile);
  51.   try
  52.     Reset(textf);
  53.     while not EOF(textf) do
  54.       begin
  55.         ReadLn(textf, s);
  56.         s := Trim(s);
  57.         idx := aList.IndexOf(s);
  58.         case idx of
  59.           -1: aList.AddObject(s, TObject(one));
  60.           else
  61.             aList.Objects[idx] := GetSuccObj(aList.Objects[idx]);
  62.         end;
  63.       end;
  64.   finally
  65.     CloseFile(textf);
  66.   end;
  67. end;
  68.  
  69. procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo);
  70. var
  71.   i: Integer;
  72. begin
  73.   if Assigned(aList) and Assigned(aMemo) then
  74.     begin
  75.       aMemo.Clear;
  76.       for i := 0 to aList.Count-1 do
  77.         aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]);
  78.     end;
  79. end;
  80.  
  81. procedure TForm1.FormCreate(Sender: TObject);
  82. var
  83.   sl: TStringList;
  84. begin
  85.   SortCount('infile.txt', sl);
  86.   try
  87.     ShowListInMemo(sl, Memo1);
  88.     Memo1.Lines.SaveToFile('outfile.txt');
  89.   finally
  90.     sl.Free;
  91.   end;
  92. end;
  93.  
  94. end.

howardpc  question to you ;)

I can't develop your code for the next need.

I want to count duplicates (or multi x3,x4...) from a CSV file according to DATETIME.

As a result (memo1) I want: DateTIME; Count; all Device_ID participating in the duplicate

Format of CSV file :


user_id,device_id,"datetime"
18817,13174,2020-01-09 00:01:14
15190,10604,2020-01-09 00:09:04
15190,10604,2020-01-09 00:09:05
10892,7559,2020-01-04 10:02:21
10892,7559,2020-01-04 10:52:59
10892,7559,2020-01-04 13:56:42
10892,7559,2020-01-04 20:46:01
15190,10604,2020-01-09 00:13:48
15190,10604,2020-01-09 00:13:48
6521,4879,2020-01-09 00:14:53

howardpc

  • Hero Member
  • *****
  • Posts: 3519
Re: Sorting and Counting
« Reply #175 on: July 01, 2020, 11:03:18 am »
Try the attached project.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #176 on: July 02, 2020, 07:01:29 am »
Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar

howardpc

  • Hero Member
  • *****
  • Posts: 3519
Re: Sorting and Counting
« Reply #177 on: July 02, 2020, 10:15:13 am »
Works with your CSV file. With my (larger file) it doesn't finish.
I'm not in the least bit surprised it fails on your "larger" file. Your file is 3.4 MB!
The code I offered was written in about an hour, and completely untested except on a single file of size 377 bytes. Your post never indicated that you were trying to analyse data files of several MB.

You can't expect code to scale to encompass data a million times bigger than it has been tested on without needing adjustment.
I don't know in the code I offered if the limitation you encounter is to do with memory or a TStringlist's natural limits (e.g. the Count property is an Integer, not an int64). Nor does it really matter. TStringList is not suited for processing such large scale data. You would be better advised to use a proper database, one designed to cope with with large datasets, rather than relying on a roll-your-own database based on TStringList.

The code I offered assumes it can load all data into memory at one go. For large datasets you need some way to cache the data, and process it in manageable chunks. You can't try to swallow it whole all at once.
In other words your data requires a different algorithm, one designed with the scale of the data in mind, not one simply designed on the basis of the data format.
« Last Edit: July 02, 2020, 10:32:18 am by howardpc »

jamie

  • Hero Member
  • *****
  • Posts: 3522
Re: Sorting and Counting
« Reply #178 on: July 02, 2020, 01:24:56 pm »
of your last example is still using the TMEMO then maybe its not able to hold the list. It is of course not intended for large list.
The only true wisdom is knowing you know nothing

jamie

  • Hero Member
  • *****
  • Posts: 3522
Re: Sorting and Counting
« Reply #179 on: July 02, 2020, 01:27:14 pm »
Try the attached project.

Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link

https://github.com/credo-science/Windows-Tools/blob/master/dup.rar

Please provide a "ZIP" file, I don't process RAR files ..

The only true wisdom is knowing you know nothing

 

TinyPortal © 2005-2018