Recent

Author Topic: Sorting and Counting  (Read 15703 times)

rvk

  • Hero Member
  • *****
  • Posts: 4281
Re: Sorting and Counting
« Reply #180 on: July 02, 2020, 01:33:29 pm »
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.

rvk

  • Hero Member
  • *****
  • Posts: 4281
Re: Sorting and Counting
« Reply #181 on: July 02, 2020, 01:51:53 pm »
If you really only want to count duplicate dates, you might want to add the string to the sorted TStringList with date first. After that you can loop the list, but only check the first 19 characters of the string. DON't work with CommaText etc... it's too slow.

In that case you get something like this:
Note the String.Split(',') which is much much faster.
There is no need for a separate unique date stringlist because we already sorted the list when loading. And because date is now the first entry it is sorted on date (for easy checking). You do need to set Duplicates to dupAccept and Sorted to true then.

Code: Pascal  [Select][+][-]
  1. procedure TForm1.FormCreate(Sender: TObject);
  2. begin
  3.   FOriginalCSV := TStringList.Create;
  4.   try
  5.     FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv');
  6.     CollectDuplicateDates(FOriginalCSV, DuplicatesMemo);
  7.   finally
  8.     FOriginalCSV.Free;
  9.   end;
  10. end;
  11.  
  12.  
  13. procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo);
  14. var
  15.   i, dups: integer;
  16.   dups_string: String;
  17.   deviceids: TStringList;
  18.   A: array of string;
  19. begin
  20.   deviceids := TStringList.Create;
  21.   try
  22.  
  23.     deviceids.Duplicates := dupAccept;
  24.     deviceids.Sorted := True;
  25.  
  26.     // make a sorted list with DATE+TIME as first entry
  27.     for i := 0 to Pred(aCSVList.Count) do
  28.     begin
  29.       A := aCSVList[i].Split(',');
  30.       if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]);
  31.     end;
  32.  
  33.     dups := 0;
  34.     dups_string := '';
  35.     for i := 1 to Pred(deviceids.Count) do
  36.     begin
  37.       // ONLY check date+time entry, first 19 characters
  38.       if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then
  39.       begin
  40.         Inc(dups);
  41.         if dups = 1 then dups_string := deviceids[i - 1];
  42.         dups_string := dups_string + ' // ' + deviceids[i];
  43.       end
  44.       else
  45.       begin
  46.         A := deviceids[i].Split(',');
  47.         if dups > 0 then
  48.           aMemo.Lines.Add('"%s"  count=%d   device ids: "%s"', [A[0], dups + 1, dups_string]);
  49.         dups := 0;
  50.         dups_string := '';
  51.       end;
  52.     end;
  53.  
  54.   finally
  55.     deviceids.Free;
  56.   end;
  57. end;

« Last Edit: July 02, 2020, 02:21:39 pm by rvk »

howardpc

  • Hero Member
  • *****
  • Posts: 3517
Re: Sorting and Counting
« Reply #182 on: July 02, 2020, 04:56:26 pm »
With rvk's more intelligent algorithm it looks like you can still use TStringList to good effect on multi-MB files within a reasonable time frame.
Time spent on developing a good design before writing a line of code is  time well spent, and can lead to cleaner and faster-performing code.
I usually find it easier to adapt/improve other people's code than come up with a winner myself first time. Perhaps this is because I'm just an autodidact  hobbyist with no formal training in IT, and I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.

lucamar

  • Hero Member
  • *****
  • Posts: 3019
Re: Sorting and Counting
« Reply #183 on: July 02, 2020, 05:26:58 pm »
[..] I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.

You've got plenty company in that, and not only of "autodidact  hobbyist"s. Professionals in a hurry tend to go for that too (and I talk from personal experience :-[)
Turbo Pascal 3 CP/M - Amstrad PCW 8256 (512 KB !!!) :P
Lazarus 2.0.8/FPC 3.0.4 - 32/64 bits on:
(K|L|X)Ubuntu 12..18, Windows XP, 7, 10 and various DOSes.

jamie

  • Hero Member
  • *****
  • Posts: 3520
Re: Sorting and Counting
« Reply #184 on: July 02, 2020, 06:33:36 pm »
@rvk, the loop is one more than it should be  8-)

The only true wisdom is knowing you know nothing

rvk

  • Hero Member
  • *****
  • Posts: 4281
Re: Sorting and Counting
« Reply #185 on: July 02, 2020, 06:45:08 pm »
@rvk, the loop is one more than it should be  8-)
In my example or in the original in the .rar?

Of course this could be done even more efficiëntly but I just reacted at the the given code in the .rar (where there is a loop within a loop). And there a TStringList was used. So I build on that.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #186 on: July 02, 2020, 09:01:49 pm »
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
Are you sure it doesn't finish????

If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.

I used Writeln (i);
The program works. But it's slow.
 
Code: Pascal  [Select][+][-]
  1.  writeln (aUniqueDates.Count);
show over 90,000.

loop  :

Code: Pascal  [Select][+][-]
  1. for i := 0 to Pred(aUniqueDates.Count) do
......

 does it for over a one second for each step.

That's why it lasts so long. 90,000 x 1 second;)

rvk

  • Hero Member
  • *****
  • Posts: 4281
Re: Sorting and Counting
« Reply #187 on: July 02, 2020, 09:04:38 pm »
That's why it lasts so long. 90,000 x 1 second;)
Look a few posts back. I showed you code which does it in less then 5 seconds.

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #188 on: July 02, 2020, 09:10:20 pm »
If you really only want to count duplicate dates,

Yes. I'm only interested in duplets, triplets and more ....

More precisely, it's about finding triplets and looking for a second (similar) triplet for the same DEVICE_ID in close proximity, e.g. 5 minutes.
It's complicated to describe because there are no exact directions. I run blind research ;)

TRon

  • Hero Member
  • *****
  • Posts: 536
Re: Sorting and Counting
« Reply #189 on: July 02, 2020, 09:33:26 pm »
uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:
Code: SQL  [Select][+][-]
  1. .mode csv
  2. .import pewniacy_odstycznia_true.csv dupdup
  3. .schema dupdup
  4. CREATE TABLE dupdup(
  5.   "user_id" TEXT,
  6.   "device_id" TEXT,
  7.   "datetime" TEXT
  8. );
  9. SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
  10.  
Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #190 on: July 02, 2020, 09:33:36 pm »
That's why it lasts so long. 90,000 x 1 second;)
Look a few posts back. I showed you code which does it in less then 5 seconds.

thx. really fast  :)

mpknap

  • Full Member
  • ***
  • Posts: 145
Re: Sorting and Counting
« Reply #191 on: July 02, 2020, 09:47:46 pm »
uhm... 1 sec * 90.000 ?

For giggles I fired up sqlite3:
Code: SQL  [Select][+][-]
  1. .mode csv
  2. .import pewniacy_odstycznia_true.csv dupdup
  3. .schema dupdup
  4. CREATE TABLE dupdup(
  5.   "user_id" TEXT,
  6.   "device_id" TEXT,
  7.   "datetime" TEXT
  8. );
  9. SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
  10.  
Which produces an instant result.

It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)

I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.

If you can do it, it will be great :)

Yes Sqlite ist faster then all ;)

TRon

  • Hero Member
  • *****
  • Posts: 536
Re: Sorting and Counting
« Reply #192 on: July 02, 2020, 09:53:08 pm »
I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.
I need their numbers the most.
If you are seriously thinking of using the data as sql dataset in your program, then i can try set it up here at my end.

The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"

The output from the statement as in my previous post displays the deviceid. Do youo mean the event can happen on the exact same date-time but using another device-id ? e.g. you need a (additional) distinction between device-id's ?

rvk

  • Hero Member
  • *****
  • Posts: 4281
Re: Sorting and Counting
« Reply #193 on: July 02, 2020, 10:09:24 pm »
Yes Sqlite ist faster then all ;)
You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.

In my example I used TStringList sorted = true and added all the lines to get a sorted list. After that I did a second loop to detect the duplicates. Both steps could be pulled together to make it even faster.

But I like the SQLite approach too. You can create an SQL statement to get the duplicate devices. You can even expand it to give duplicates in 5 minutes of each other like you mentioned (although that will become a somewhat advanced sql :) ).

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.

TRon

  • Hero Member
  • *****
  • Posts: 536
Re: Sorting and Counting
« Reply #194 on: July 02, 2020, 10:58:16 pm »
You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.
In howardpc's and your defence (probably others as well, I haven't read the whole thread), initially you people had to work with the sample data (which in hindsight wasn't a good representation of the actual situation)

@mpknap
And of course there are the exceptions. If you are going to convert a decades old database that needs an upgrade, do you prefer speed over consistency ? I don't and will choose the slower more precise solution over any speedy one that might be sloppy. Time is usually of no concern in such cases (and if there is then those that put on the time-restriction can go play with themselves, as they had decades to think about their sloppy maintenance) (*)

That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.
Exactly.

@mpknap
Especially if it concerns a project that is actually a little over your head (for whatever reason). I usually look at the bigger picture first, write down that flow of the program and then try to split up the big(ger) chunks of the program in parts that I am still able to take on, again on paper. In case I need to use techniques (or topics) that I have never dealt with before, I can check those first in a (small) test-program before incorporating such parts into the final program. It's a balance between keeping the bigger picture in mind while at the same time working on detailed implementations. The more work you do beforehand (on paper) the easier it becomes to actually implement the code.

(*) Therefor, also in relation to what I wrote directly above, it is also important to think about things like speed beforehand. Sometimes it matters, sometimes it won't. More is not always better  :D
« Last Edit: July 02, 2020, 11:00:52 pm by TRon »

 

TinyPortal © 2005-2018