The algorithm computes something called the "common language effect size" (cles) in statistics. The core of it is simple : given 2 lists of numbers or "scores" you compare all pairs in the lists and if the score in list1 is higher than the score in list2 you add 1 to the numerator of the effect size fraction and if the scores are equal you add 0.5 to the numerator. Finally you divide the numerator by the total number of comparisons (length(list1) * length(list2)).
Two versions of this algorithm are available on Github in Python. The above algorithm is included but is described as "naive", however not knowing Python I don't understand how the other more efficient one works. With just a small number of values in the lists it's not too bad but with thousands it will be very slow. Any suggestions? Thanks.
Although it has been discussed in some (or many) of the previous posts I failed to grasp the
importance of repeated or equal values in the arrays. The algorithm has to find the equality of
elemntes between one array and the other, which suggests (at least to me) that it is likely that
there are repeated values within each of the arrays. The example given in the initial post contains
duplicated values in both arrays. If that is the case this feature can be used to speed up the
algorithm provided of course that the arrays are ordered. I heve developed an algorithm based on these
ideas:
- Do not do the calculations inside the loop. Rather count how many s1 > s2 ("GreaterThan") and how
many s1 = s2 ("EqualTo") events are and do the math at the end.
- Start the loop at the High end of the arrays and work all the way down to 1 (for results comparable
to the BruteForce approach) or 0. Working this way makes the program more understandable, at
least to me.
- Consider the sorted arrays as a sucession of chunks or segments with one or more equal values.
- For each segment in set1 find the segment in set2 which has the same value. If it exists, the
number of EqualTo events is the same as the length of the segment. Then find the higher segment
in set2 that is less than the value of the segment in set1. The number of GreaterThan events is
given by the highest position of the set2 segment.
- Once the values GreaterThan and EqualTo are calculated for the segment in set1 multiply them times
the length of this segment since all its items have the same value and therefore get the same
events. This can save a lot of time. If for instance the average segment is 10 items long, the
total time would be reduced to roughly one tenth.
- Move down to the previous (lower index since I work top to bottom) set1 segment and repeat the
process until the beginning of set1 is reached.
- And now do the calc. Watch for overflows that could happen with more than 50000 items in the arrays.
I have programmed this algorithm in a fully commented procedure (name CalcCles2) and have used it
to build a little test program (inspired in one of the examples posted) and will try to attach it
here. I am getting run times of 3,5 Ticks versus 3800 ticks of the bruteforce approach for 20000
item arrays. And the two calculations are coincidental to at least 9 decimal places.
And yes, I think this time I got it right.
Anyway it would be interesting to know if my asumptions about repeated values are correct,
Cheers.