Possible approach - ordered associative array with uniq keys.FGL is standard with TFPGMap in it.
GMap from fcl-stl - https://github.com/graemeg/freepascal/blob/master/packages/fcl-stl/doc/main.pdf
Why do I think this is home work ?
In any case I would use a TStringList, load it, sort it using the features of the TStringList and maybe even use the object of each as a counter.
gmap might depend on the number of unique values. If that gets large it might also slow down.About ~ 10000000 random longint values.
Unfortunately, it’s not only a matter of resources: on a dataset of the size specified above, your solution is 4 times(90 s.) slower than engkin's one(22 s.) and it is prohibitively slow for a task of this size.I have changed my mind.
A generic container like TStringList is not well suited to this task on large files.
The TXT file is large, sorting takes more than an hour.
@Akira1364, it seems your InputGenerator creates more than 99% of unique values.
Thank you gentlemen. At the moment I'm testing the Engkin algorithm. The TXT file is large, sorting takes more than an hour. Tomorrow I will check your other suggestions. I do not really care about speed, because it's not for the program user only for me. It is important that the result is correct.Can you share input file?
In a sense, it's homework, but it's for me. I'm not a programmer by education, just a hobby.Check also this - https://en.wikipedia.org/wiki/External_sorting.
I slightly changed the Akira1364's program:
Yeah, LGenerics is really good...Nice to hear kindly words from you. :)
Nice to hear kindly words from you. :)
did you use Akira's InputGenerator program to generate the input file ?
I imagine he used the original "20MB" code of his from an earlier comment for that one.I will use that. Thank you Akira.
What do you get if you change the ...
Also FWIW, I'm quite sure that for Mangakissa's version, it's this part ...It also seems that the array search has a quadratic complexity.
@440bxExcellent!. Thank you Avk.
For current version Akira's InputGenerator:
Can this be done with {$mode objfpc}?Obviously, yes!, See Howards comments above.
@julkas, doneI appreciate that, thank you. :)
@440bx:
So, do you offer to compete with the Windows file mapping and ntdll? Okay, why not. :)
To run your SortCount (from SortCount2) inside a Howard's benchmark, I put the SortCount2 code in a separate unit:
The differences are not large but, they are consistent. When executed many times over, your implementation performs better. One possible reason that comes to mind is, qsort has to call the compare function "many times" and, I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.) I thought that not having to do string to integer conversions would compensate for the overhead of passing parameters to the sort function, apparently that is not enough.
repeatMillionsCount = 10 Avk2's time: 3.0730 #unique: 799999 #total: 10000000 440bx's time: 3.1670 #unique: 799999 #total: 10000000 repeatMillionsCount = 12 Avk2's time: 3.6350 #unique: 799999 #total: 12000000 440bx's time: 3.7600 #unique: 799999 #total: 12000000 repeatMillionsCount = 14 Avk2's time: 4.1970 #unique: 800000 #total: 14000000 440bx's time: 4.3360 #unique: 800000 #total: 14000000 repeatMillionsCount = 16 Avk2's time: 4.7580 #unique: 800000 #total: 16000000 440bx's time: 4.9300 #unique: 800000 #total: 16000000 repeatMillionsCount = 18 Avk2's time: 5.3040 #unique: 800000 #total: 18000000 440bx's time: 5.5220 #unique: 800000 #total: 18000000 repeatMillionsCount = 20 Avk2's time: 5.9130 #unique: 800000 #total: 20000000 440bx's time: 6.0840 #unique: 800000 #total: 20000000
A typical comparison looks like this (I'm on Linux here, so have sadly omitted 440bx)One thing is for sure, I cannot compete on portability <chuckle>
I'm guessing, TGOrdinalArrayHelper doesn't have the overhead of calling a compare function (I haven't looked at how it is implemented, therefore I really don't know if that may be the reason.)
As a (happy!) user of LGenerics, I can tell you: it doesn't. As it's TGOrdinalArrayHelper, it uses a built-in sort specifically designed for, well, ordinals (or at least, anything that can be directly compared to them without need of an explicit comparison function, which could possibly also be achieved via operator overloading in the case of more complex types.)Thank you for clearing that. As you pointed out, the presence of Ordinal in the helper name is a good hint that the helper is designed and optimized for ordinal types. I suspected it but, without looking at the actual code, I didn't want to draw any conclusions based on the just the name.
For anything that doesn't fit that description it just won't compile if specialized with them (failing explicitly on the lines that attempt to do the comparisons, obviously.)That makes perfect sense.
Just to see what might be done without using generics, I tried replacing the sort in Akv2 with a HeapSort and then a QuickSort (in howardpc's OccurenceCounter). The HeapSort version was about twice as slow as howardpc's, while the QuickSort version edged out that code. I did not directly compare it to Akv2, but that code smokes.
Very interesting!
Fascinating thread.yes, it is.
TGOrdinalArrayHelper itself does also have a QuickSort, BTW. Also IntroSort and "DualPivotQuickSort."
I would like to get in on this adventure myself :DGo on. You are welcome. ;)
However, you guys wouldn't like my version because it would involve using Assembler code. >:D
Benchmark is compiled with a 32-bit compiler and runs on 64-bit Windows7.Thank you.
I would like to get in on this adventure myself :D
However, you guys wouldn't like my version because it would involve using Assembler code. >:D
@ASerge, I'm not sure about rtl-generics, but LGenerics is definitely incompatible with FPC 3.0.4.OK. What about project with FPC 3.3.1?
OK. What about project with FPC 3.3.1?And what's wrong with the project for 3.3.1?
@ 440BXbut only for 64bit. In 32bit, you win :). The cost of pushing and popping parameters is just too high.
64-bit version, it seems you won :):
And what's wrong with the project for 3.3.1?Attach it , please.
Also generics.collections is way faster and more modern ...Is it true ?
Usually, yes, but mind the remarks I made in the other thread: it is about some internals. In this thread somebody used those low-level adjustments. See if you can spot who did... :-XIn this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
The uses clause is usually a dead giveaway.... Analyze the inheritance...
I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.You can legitimately say that about the algorithms you used to solve the problem, I _cannot_ say that about the implementations I presented, I traded cleanliness for speed.
In this case I don't want low-level tricks, fast I/O, ... I want short, clean and fast as possible solution with well known data structures (classes) and algorithms. So I try understand pros and cons of different Pascal generics implementation.
In real life I can't use even fcl-stl. 80% of my Pascal code (DS, algorithm) is written from scratch.
At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest.+1
Replacing the Format() in my original implementation with a simple Writeln() gives a slight speed increase.I want short, clean and fast as possible solution with well known data structures (classes) and algorithms.At least theoretically, it seems an optimized version (still clean) of Howard's customized radix sort should usually be the fastest. Its downside is, in some cases, it can take a lot more memory than desirable.
That's why the results seem strange to me, because Howard's algorithm is O(n), and algorithms with sorts is O(n*lg(n)). But it is impossible to check, @avk did not provide the project.They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.
They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.But for large n it's better. And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.
Yes, provided that the range is reasonably close to n. As the ratio of range/n increases, a radix sort suffers.They do look a bit strange until you carefully analyze the algorithm's implementation and the data it has to handle.But for large n it's better.
And the algorithm can be improved. Since memory is still allocated a lot and we know the data format, it is better to set the minimum and maximum as constants.True, the concern is, once the algorithm uses knowledge about the data format it didn't determine itself, the
I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.Just curious how much performance has degraded?
... I believe Avk2 uses an Introsort.All LGArrayHelpers sorting algorithms are hybrid, in particular TGOrdinalArrayHelper.Sort tries to use
... But it is impossible to check, @avk did not provide the project.Let me guess, the project of which you always mention this is LGenerics? If so, see attachment.
...So my fcl-stl TVector solution is better than Akira's generics.collections TDictionary...This is highly dependent on the input data.
A curious fact, all coincide, except SortCount440bx.that's because of the compare function. The number of instances of each number is correct but, the collating sequence is different than what is obtained in a numerical comparison.
Hello @avk, @hnb ! (See - https://forum.lazarus.freepascal.org/index.php/topic,46254.0.html)I replaced fcl-stl TVector with generics.collections TList in my algo. I don't know why TList gives very poor performance.Just curious how much performance has degraded?
At least on my machine, the performance of your code is second only to 440bx one.My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)
ETA:
Bruno's implementation can be made a smidgen faster by changing the compare function to test for equality first (since there are more duplicate values than unique), that would lower the number of comparisons required to determine relative magnitudes.
Result := 1 - integer( a=b ) - 2*integer( a<b );
I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.Code: [Select]Result := 1 - integer( a=b ) - 2*integer( a<b );
At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.
MathMan
My entry should be disqualified because the sort sequence it generates is not the sort sequence expected by the "user" (in this case the OP.)And this can not be fixed?
I am surprised that is faster because in order to calculate the result, the arithmetic expression, unlike a Boolean expression, must be fully evaluated which means in all cases, two (2) compares instead of possibly just one, will be necessary.Code: [Select]Result := 1 - integer( a=b ) - 2*integer( a<b );
At least on my system (Core i6700k) the latter runs a roughly tripple speed of the former.
MathMan
And this can not be fixed?It sure can and, Bruno and yourself are using the "fix" I'd have to use, which is, doing string to integer conversion without readln do it for you (which is slow).
...Both of your algorithms can be made even faster by not using writeln. I suppose there must be an object (I'm guessing, a TMemoryStream), that would allow writing an entire block of memory (properly formatted beforehand) in one shot instead of a gazillion writeln(s)...IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.
...But why notCode: [Select]Result := 1 - integer( a=b ) - 2*integer( a<b );
...
IMO what is already there is already too much, all this things move the code farther and farther from correctness, simplicity and portability. But curious.I completely agree with that. I admit to being curious too and, there are a number of optimizations that come to mind but, it really feels they are completely out of place for what should be (and can be) a very simple program.
......But why not...
Result := 1 - integer( a=b ) - 2*integer( a<b ); ?
Result := Integer(a > b) - Integer(a < b);
Why not add logarithms to increase the effect of obfuscation. ;DCode: [Select]Result := 1 - integer( a=b ) - 2*integer( a<b );
In earnest: If only the sign of the result of the compare function is evaluated by the sort, wouldn't it be sufficient to just subtract the values?You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.
function ComparePairs(constref L, R: TIntPair): LongInt; begin Result := L.Key - R.Key; end;
Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way. I cannot think of one.I think it is normally done that way.
You've just shown a bit of code that, once seen, seems totally obvious and makes one wonder why that isn't the way everyone does it.
Makes me wonder if there is a reason, other than simply not thinking about it, why it isn't normally done that way. I cannot think of one.
I think it is normally done that way.I've read a lot of code in various languages and, I think it's the first time I see it done that way, because I'd remember it. Now that I've seen it, I'm not about to forget it.
I've seen that very code (or something almost identical) both in this forum (I believe it was in code from Marco) and in the FPC sources.
But yes, if only the sign is required then simple subtraction should be sufficient.For a sort compare function only the sign should matter (provided the sort function doesn't compare against hard coded values, -1, 0, 1, which it definitely shouldn't.)
But what happens if, for example, L.Key = 1500000000 and R.Key = -1500000005?Yes, you are right. Those values would cause an overflow which would incorrectly indicate that L is less than R.
Ugg , looks too much like BASIC.....BASIC rocks.
>:(
Ugg , looks too much like BASIC.....
>:(
but it works :)Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.
Python sort implementation is exelent. It's based on Team sort algo.but it works :)Just curiosity, it would be interesting to see how the performance of that Python implementation compares with the various Pascal implementations.
You only have to wish, sir. :)Thank you very much Avk. :)
Can you compare only sorting phase time?. (Python default I/O is slow).Sort phase only:
@avk Thanks.Can you compare only sorting phase time?. (Python default I/O is slow).Sort phase only:
PythonPascal
Time elapsed: 1.1443881560007867 , #unique: 999955 , #total: 10000000
Time elapsed: 0.1750, #unique: 999955, #total: 10000000
@avk Thanks.Julkas, ideone lags behind in compiler version for FPC and that can make a big difference. You should run the complete test suite on ideone to get any meaningful comparison.
https://ideone.com/9n168K
We can't compare Python (interpreted Lang) with Pascal.Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
PYTHON rocks.We can't compare Python (interpreted Lang) with Pascal.Yes, we can since Python relies so heavily on library code compiled in other languages. Python is just glue. Much more so than pure scripting.
(You can also use FPC to write and add Python libraries)
And indeed:
I use Python and I like Python... :P :o
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
unit mainSortCount; {$mode objfpc}{$H+} interface uses Classes, SysUtils, Forms, StdCtrls; type TForm1 = class(TForm) Memo1: TMemo; procedure FormCreate(Sender: TObject); end; var Form1: TForm1; procedure SortCount(const anInFile: String; out aList: TStringList); procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo); implementation {$R *.lfm} { TForm1 } procedure SortCount(const anInFile: String; out aList: TStringList); const one = PtrUInt(1); var textf: TextFile; s: String; idx: Integer; function GetSuccObj(anIntObj: TObject): TObject; var i: PtrUInt absolute anIntObj; begin Inc(i); Exit(anIntObj); end; begin Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"'); aList := TStringList.Create; aList.Duplicates := dupError; aList.Sorted := True; AssignFile(textf, anInFile); try Reset(textf); while not EOF(textf) do begin ReadLn(textf, s); s := Trim(s); idx := aList.IndexOf(s); case idx of -1: aList.AddObject(s, TObject(one)); else aList.Objects[idx] := GetSuccObj(aList.Objects[idx]); end; end; finally CloseFile(textf); end; end; procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo); var i: Integer; begin if Assigned(aList) and Assigned(aMemo) then begin aMemo.Clear; for i := 0 to aList.Count-1 do aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]); end; end; procedure TForm1.FormCreate(Sender: TObject); var sl: TStringList; begin SortCount('infile.txt', sl); try ShowListInMemo(sl, Memo1); Memo1.Lines.SaveToFile('outfile.txt'); finally sl.Free; end; end; end.
After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?
... I have a problem installing the LG package ....
After sorting, I don't know how to refer to the other variables in the record, but I don't want to ask for more :)Question for you, do you want your program to run on multiple platforms (e.g, Linux, Windows, other) or is Windows only acceptable ?
ETA: if Windows-only is acceptable then, another question: are the records in the file fixed length or variable length ?
What kind of install? Just make sure it is in your path.I will try as you write. I installed the Lgenerics.LPK file, there was an error and I gave up.
Just to make sure I understand the structure of your file and its records. It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank. I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ?
1570485826087 - UnixTime
0,0 -lat
0,0 -lon
Patrycja Macała -user Name
16584 -User Number
ekonomik1 -Team Name
1581 -Team Number
- Free line as separator
1570485826087
0,0
0,0
Gargastw├│r
17943
kosmos
1243
<many more>
1567365497795
0,0
0,0
J.C.K.
6037
1
1567367488139
0,0
0,0
J.C.K.
6037
1
I am interested in obtaining such information :Mostly with the help of the other contributors in this thread since my implementation doesn't even use the appropriate collating sequence.
- how many detections are in the same second / minute / hour. (I already got it thanks to your help)
- If there are several in the same second / minute / hour, display which users and their other data as number, coordinates, Team name etc ..Piece of pie. I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.
[/font][/size]
Just to make sure I understand the structure of your file and its records. It looks like every record in the file consists of seven (7) fields (one per line) and that, sometimes, a field may be blank. I want to confirm that every record is always 7 fields (though one or more, except the unixtime, may be blank), is this correct ?
Thanks .
Piece of pie. I got several things to do today so I won't commit to a solid timeframe but, I'll give you something soon.
I don't want you to think that I'm using you, what I do is on my own, for my own curiosity.Don't worry about that.
Detections in smartphones are the impact of cosmic ray particles on the phone's camera (photons, electrons, muons).Experiments like that have the potential to yield surprising results. Many scientific discoveries resulted from unexpected side effects. Some that come to mind are microwave ovens, the scientists were constantly having headaches, that's how they figured out that the microwaves were frying their brains (not kidding... though, saying "frying" is a bit of an exaggeration.)
<snip>
Scientists are also looking for links between detection frequencies and other phenomena,
@mpknap, did you mean something like that?that's exactly how I would see it. I have to test. I want to transfer this data to Google map.
In fact, everything is not as good as you say.but, you proved again that errare humanum est. ;)
...In order to use it (FPC 3.3.1 and higher and Lazarus 1.9.0 and higher)...If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.
If installing the appropriate version of the compiler is a serious problem for you, I might think about how to do without LGenerics.If you could, it would be nice. This will be open code, and I don't want to oblige other interested parties to install Packages. It's supposed to be the easiest, not necessarily fast.
Ok, basically we only need sorting and binary search algorithms.Just in case you may be interested and, in addition to what Thaddy above mentioned, under Windows, ntdll provides a typical bsearch function to search a sorted sequence.
A fairly good sorting algorithm is available in fcl-stl. But I had to write BinarySearch from scratch,
I hope that it will work correctly. Let me know if anything goes wrong.
AVK, I'm sorry but I still have problem with starting sort_count.lpk. He refers to LGeneric, and he can't find RTTI. I even installed the latest version of Lazarus 2.0.6 FPC 3.0.4. for windows 10/64.lgenerics needs 3.2.0 or trunk. Not 3.0.4. avk already wrote that.
Thank you anyway.
@mpknap, please test this version.Back-porting just before a major release? :D :D :D :D
However, there is a significant point. In our case, we need a binary search that, in the case of duplicate values, returns the position of the leftmost one.Yes, that is definitely an important difference in this case. In such a case, when using bsearch, the resulting index when a match is found, has to be "manually adjusted" to ensure it is the index of the first instance match.
And thus, the O(log N) algorithm (theoretically) turns into the O(N) algorithm?It normally wouldn't be O(n), it would be (lg N) + (avg_dups_per_key/2).
...In the worst case, for 1 element duplicated N times, then it would be O(N).yes it is, and for N/2, N/4, ... duplicated elements.
Yes but, it depends on how the duplicates are distributed. For instance, consider N elements and all the duplicates are clustered in, just for example, 4 elements. Accessing any of those 4 elements is basically O(n), while accessing any other element is O(lg N). For such a distribution (granted, it is an unusual one), the big O of the totality is neither O(n) nor O(lg N), for this example it would be (2n + (n - 4) * lg n)...In the worst case, for 1 element duplicated N times, then it would be O(N).yes it is, and for N/2, N/4, ... duplicated elements.
@mpknap, please test this version.Yes Yes Yes!!! It works! And that was what I meant, this kind of sorting and displaying information.
I think TStringList is eminently suitable for this task.
Here's an alternative solution, which may use less resources.
unit mainSortCount; {$mode objfpc}{$H+} interface uses Classes, SysUtils, Forms, StdCtrls; type TForm1 = class(TForm) Memo1: TMemo; procedure FormCreate(Sender: TObject); end; var Form1: TForm1; procedure SortCount(const anInFile: String; out aList: TStringList); procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo); implementation {$R *.lfm} { TForm1 } procedure SortCount(const anInFile: String; out aList: TStringList); const one = PtrUInt(1); var textf: TextFile; s: String; idx: Integer; function GetSuccObj(anIntObj: TObject): TObject; var i: PtrUInt absolute anIntObj; begin Inc(i); Exit(anIntObj); end; begin Assert(FileExists(anInFile), 'cannot find file "'+anInFile+'"'); aList := TStringList.Create; aList.Duplicates := dupError; aList.Sorted := True; AssignFile(textf, anInFile); try Reset(textf); while not EOF(textf) do begin ReadLn(textf, s); s := Trim(s); idx := aList.IndexOf(s); case idx of -1: aList.AddObject(s, TObject(one)); else aList.Objects[idx] := GetSuccObj(aList.Objects[idx]); end; end; finally CloseFile(textf); end; end; procedure ShowListInMemo(constref aList: TStringList; aMemo: TMemo); var i: Integer; begin if Assigned(aList) and Assigned(aMemo) then begin aMemo.Clear; for i := 0 to aList.Count-1 do aMemo.Lines.Add('%s - %d', [aList[i], PtrUInt(aList.Objects[i])]); end; end; procedure TForm1.FormCreate(Sender: TObject); var sl: TStringList; begin SortCount('infile.txt', sl); try ShowListInMemo(sl, Memo1); Memo1.Lines.SaveToFile('outfile.txt'); finally sl.Free; end; end; end.
Try the attached project.
Works with your CSV file. With my (larger file) it doesn't finish.I'm not in the least bit surprised it fails on your "larger" file. Your file is 3.4 MB!
Try the attached project.
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the link
https://github.com/credo-science/Windows-Tools/blob/master/dup.rar
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the linkAre you sure it doesn't finish????
[..] I tend to go for a brute force approach, when a few more minutes reflection would save me effort in the long run.
@rvk, the loop is one more than it should be 8-)In my example or in the original in the .rar?
Works with your CSV file. With my (larger file) it doesn't finish. See the DUP.ZIP project in the linkAre you sure it doesn't finish????
If I put in a writeln(I) it does progress but with two lines a second. With 104.000 lines it will take 14 minutes to complete.
I don't think this code is the best way to find duplicates in large files.
That's why it lasts so long. 90,000 x 1 second;)Look a few posts back. I showed you code which does it in less then 5 seconds.
If you really only want to count duplicate dates,
That's why it lasts so long. 90,000 x 1 second;)Look a few posts back. I showed you code which does it in less then 5 seconds.
uhm... 1 sec * 90.000 ?
For giggles I fired up sqlite3:Which produces an instant result.
.mode csv .import pewniacy_odstycznia_true.csv dupdup .schema dupdup CREATE TABLE dupdup( "user_id" TEXT, "device_id" TEXT, "datetime" TEXT ); SELECT *, COUNT(datetime) AS dupes FROM dupdup GROUP BY datetime HAVING dupes > 1;
It simply indicates you're using the wrong tool for the job. (and by tool, I meant classes/solution/appraoch, not Pascal as a language)
I at SQLITE (DBBrowser for Sqlite) try to do this. The problem is that it is not possible to display in the DEVICEID output column all the Devices participating in the multi event.If you are seriously thinking of using the data as sql dataset in your program, then i can try set it up here at my end.
I need their numbers the most.
Yes Sqlite ist faster then all ;)You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.
You can do it that fast in pascal too. Like TRon already said. It's was the approach that was wrong.In howardpc's and your defence (probably others as well, I haven't read the whole thread), initially you people had to work with the sample data (which in hindsight wasn't a good representation of the actual situation)
That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.Exactly.
That's why it's important to first think of what you want, set it on paper, have a design, and then (and not sooner) go programming.
The only problem is that i do not know exactly what you mean by "it is not possible to display in the DEVICEID output column all the Devices participating in the multi event"TRON.
The point is that nobody knows what we're looking for and how to find it.Just a very general comment. When you believe there might be something to be found in the data but don't even know what, that's when SQL databases are great. (not the only thing they are excellent for but, that's one of them.)
The point is that nobody knows what we're looking for and how to find it. These are just my guesses.I have no idea what your program is suppose to be doing as a final result or how this should be presented to the user, since you are the programmer you are the one in control. So, yes unless you do not know what you wish to achieve/accomplish in the end then we do not know either ;)
This can be compared to "looking for rain drops that will fall on three flowers in a large garden at the same time. If this event occurs again within a few minutes (at least once), there is a suspicion of success."The answer to that question is 42 btw.
If the value in the DUP column is, for example, "2", we still do not know which Device_ID make up it. There would have to be an additional column in which it will show Devices numbers ... See screen in the attachmentYeah, and that is impossible to realise because the dupcount is/can be made up of multiple Device_ID's.
procedure TForm1.FormCreate(Sender: TObject); begin FOriginalCSV := TStringList.Create; try FOriginalCSV.LoadFromFile('pewniacy_odstycznia_true.csv'); CollectDuplicateDates(FOriginalCSV, DuplicatesMemo); finally FOriginalCSV.Free; end; end; procedure TForm1.CollectDuplicateDates(aCSVList: TStrings; aMemo: TMemo); var i, dups: integer; dups_string: String; deviceids: TStringList; A: array of string; begin deviceids := TStringList.Create; try deviceids.Duplicates := dupAccept; deviceids.Sorted := True; // make a sorted list with DATE+TIME as first entry for i := 0 to Pred(aCSVList.Count) do begin A := aCSVList[i].Split(','); if (Length(A) > 2) then deviceids.Add(A[2] + ',' + A[0] + ',' + A[1]); end; dups := 0; dups_string := ''; for i := 1 to Pred(deviceids.Count) do begin // ONLY check date+time entry, first 19 characters if copy(deviceids[i - 1], 1, 19) = copy(deviceids[i], 1, 19) then begin Inc(dups); if dups = 1 then dups_string := deviceids[i - 1]; dups_string := dups_string + ' // ' + deviceids[i]; end else begin A := deviceids[i].Split(','); if dups > 0 then aMemo.Lines.Add('"%s" count=%d device ids: "%s"', [A[0], dups + 1, dups_string]); dups := 0; dups_string := ''; end; end; finally deviceids.Free; end; end;
I need one more condition in your code.Is you user-id always the same as device-id on the same date+time?
I want them to be displayed in Memo, only records where DeviceID are not the same.
If there are 3 dup for DateTime and DeviceID are 3 times the same then we reject it.
I try to do it myself but it doesn't work out.
This will bring me closer to finding "raindrops";)
Is you user-id always the same as device-id on the same date+time?
In that case my initial thought was correct and you can just match the entire line during sorting. Set duplicates to dupIgnore and same user,device,datetimes are ignored.
So this should be sufficient
deviceids.Duplicates := dupIgnore;
Its work :)So does
Its work :)So doesIt still doesn't mean that picture of yours is reproducible or representable for your data ... ;)
SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
You are genius!!!!
You don't even know how much I was looking for! Revelation :)I'm pleased to learn that it is useful for you.
Its work :)So doesIt still doesn't mean that picture of yours is reproducible or representable for your data ... ;)
SELECT rowid, datetime, device_id, user_id, COUNT(datetime) AS dupes, GROUP_CONCAT(DISTINCT device_id || ' (' || user_id || ')' ) AS dup_ids FROM DATA GROUP BY datetime HAVING dupes > 1 AND instr(dup_ids, ',') > 0;
You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).
In the JPG attachment with explanation ;)Just for the record. In the dataset you shared with us, there is no such data. In that selection, there doesn't seem to exist any data that matches the criteria with having
Something with having count(DISTINCT device_id) > 2 or likewise???You can make the condition that only those DUP_IDS are displayed where the number of "," is greater than 3 (comma).Yes that is possible to realise, just not very reliable (it involves deleting characters from the original string and comparing the length of both strings in order to determine how many comma's there are).
Something with having count(DISTINCT device_id) > 2 or likewise???Interesting.
...the second one (and I agree that it is tempting to want to try) unfortunately seem to include more results then the original statement we started out with. I haven't been able to determine which data exactly it concerns (so unable to tell why).Oh, wait... seems I made an error in my verification SQL statement there. :-[ Sorry about that.
SELECT datetime(TIMESTAMP/1000,'unixepoch') AS czas, COUNT(datetime(TIMESTAMP/1000,'unixepoch')) AS dupes, GROUP_CONCAT( DISTINCT device_id ) AS dup_ids FROM detections GROUP BY datetime(TIMESTAMP/1000,'unixepoch') HAVING dupes > 2 AND COUNT(DISTINCT device_id) > 2
I'm not sure you even need the dupes column then????
SELECT datetime(TIMESTAMP/1000,'unixepoch') AS czas, GROUP_CONCAT( DISTINCT device_id ) AS dup_ids FROM detections GROUP BY datetime(TIMESTAMP/1000,'unixepoch') HAVING COUNT(DISTINCT device_id) > 2
Other than that I am also unable to test it further as the provided dataset does not contain any data matching the criteria.
@mpknap: as stated before: Change the objective and you can start redesigning your statement(s) ;)
Have you considered creating your own custom function(s) using Pascal ? see also: http://www.sqlite.org/c3ref/create_function.html as unfortunately SQLite does not seem to support the statement "create function".
Interesting, but possible in Pascal?Of course. I use that all the time for my special needs... Mind the cdecl for your external libraries, though.
Yikes. That's more than the 104.000 lines you gave before. That illustrates the point extra. You should have mentioned that at the beginning. A simple one TStringList solution with sorting in memory isn't really feasible in that case and we would have suggested a DB solution from the beginning.Quote@mpknap: as stated before: Change the objective and you can start redesigning your statement(s) ;)Yes, I know, but unfortunately the original SQLITE file is 4.5GB.
Interesting, but possible in Pascal?As Thaddy already wrote, yes
Yes, I know, but unfortunately the original SQLITE file is 4.5GB.Ah, the final requirements/conditions. It took only #210 posts ;)