Can this algorithm be made more efficient?

440bx

Hero Member
Posts: 4033

Re: Can this algorithm be made more efficient?

« Reply #45 on: July 07, 2018, 02:38:21 am »

Quote from: VTwin on July 07, 2018, 02:25:55 am

The problem is clearly defined. It is a straight forward statistical function.

Now I really have to disagree. There are many questions that so far are left unanswered. Among them:

1. what is the valid range of the data ?. Obviously, it isn't 1 to several trillions.
2. how are duplicate values in either or both tables supposed to be handled ? ... are they even valid ?
3. what should the result be if the 2 tables are identical ? is that an indication of invalid data ?
4. what should be done if the maximum value in one of the arrays is smaller than the smallest value in the other array.

None of these questions, which are critical in the implementation of a _valid_ algorithm have been answered.

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

VTwin

Hero Member
Posts: 1215
Former Turbo Pascal 3 user

Re: Can this algorithm be made more efficient?

« Reply #46 on: July 07, 2018, 02:55:00 am »

Quote from: 440bx on July 07, 2018, 02:38:21 am

Quote from: VTwin on July 07, 2018, 02:25:55 am
The problem is clearly defined. It is a straight forward statistical function.

Now I really have to disagree. There are many questions that so far are left unanswered. Among them:

1. what is the valid range of the data ?. Obviously, it isn't 1 to several trillions.
2. how are duplicate values in either or both tables supposed to be handled ? ... are they even valid ?
3. what should the result be if the 2 tables are identical ? is that an indication of invalid data ?
4. what should be done if the maximum value in one of the arrays is smaller than the smallest value in the other array.

None of these questions, which are critical in the implementation of a _valid_ algorithm have been answered.

1. Within the range of an IEEE double.
2. That is handled in the algorithm.
3. The function returns 0.5.
4. That is handled in the algorithm.

Cheers,
VTwin

Logged

“Talk is cheap. Show me the code.” -Linus Torvalds

Free Pascal Compiler 3.2.2
macOS 12.1: Lazarus 2.2.6 (64 bit Cocoa M1)
Ubuntu 18.04.3: Lazarus 2.2.6 (64 bit on VBox)
Windows 7 Pro SP1: Lazarus 2.2.6 (64 bit on VBox)

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Can this algorithm be made more efficient?

« Reply #47 on: July 07, 2018, 09:15:43 am »

Quote from: VTwin on July 07, 2018, 02:55:00 am

1. Within the range of an IEEE double.
2. That is handled in the algorithm.
3. The function returns 0.5.
4. That is handled in the algorithm.

Cheers,
VTwin

I concur. If one has brains, use them, If one has eyes, try to find a way to use them.

These eyes may be useful.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

juanba

Newbie
Posts: 3

Re: Can this algorithm be made more efficient?

« Reply #48 on: July 07, 2018, 10:09:02 am »

Quote from: VTwin on July 05, 2018, 09:59:41 pm

Welcome to the forum!

What result does your function give? Your attachment has zero bytes.

Thanks for your welcome.
Sorry about my delay in answering. Blame it to my living in the Central European Timezone.
I have been polishing my routine for optimizing the CLES algorithm and measuring the results. In my solution I use static arrays, QuickSort and (integer)counting of the "Greater than" and "Equal to" events. I am not completely sure of the correctness of the program but I have checked the results against the original solution and they match 100% every time.
Here is the code:

Code: Pascal [Select][+]

function CalcCles2: double;
var p1, p2e, p2: longint;
    cles: double;
begin
  QuickSort1(1, n1);
  QuickSort2(1, n2);
 
  p1 := 1;                             // Set1 index
  p2 := 1;                             // Set2 index such that set1[p1] > set2[p2]
  p2e := 0;                            // Set2 index such that set1[p1] = set2[p2e]
  mayor := 0;                        // number of comparisons resulting in >
  igual := 0;                          // number of comparisons resulting in =
                                           // Find the first set1[p1] >= set2[1] 
  while (p1 <= n1) and (Set1[p1] < Set2[p2]) do
    Inc(p1);
  repeat
    if p1 > n1 then
      break;
                                     // if p1 > n1, we are done
                                     // Otherwise find the greatest p2 for wich
                                     // still holds that set1[p1] > set2[p2]
    while (p2 < n2) and (Set1[p1] > Set2[p2 + 1]) do
      Inc(p2);
 
    if (Set1[p1] = Set2[p2]) then
      Inc(igual)
    else
      mayor := mayor + p2;            // set[p1] is greater than all set2[p2]
                                                   // with p2 going down to 1
                                                   // Now find how many set2[] are equal
                                                   // to set1[p1] (if any)
    p2e := p2;
    while (p2e < n2) and (Set1[p1] = Set2[p2e + 1]) do
      Inc(p2e);
 
    if (p2e > p2) then
      igual := igual + (p2e - p2);
                                       // If set1[p1] is repeated in set1, we already
                                       // know how to deal with it but I don't think
                                       // it is worth while working along this line.
                                       // Finished with this p1. Move to the next i
    Inc(p1);
  until p1 > n1;
  cles := (igual * 0.5 + mayor) / (1.0 * n1 * n2);
  Result := cles;
end;
 

And these are the running times for different array sizes:

N1	N2	Time_Orig	Time_Optim	GreaterThan	EqualTo
1000	1000	7.3 mSec	630 µSec	514394	9955
2000	2000	30.4 mSec	1.1 mSec	1974194	39658
5000	5000	188.8 mSec	4.7 mSec	12305492	249907
10000	10000	766.0 mSec	7.8 mSec	49665474	1000512
20000	20000	3.057 Sec	19.5 mSec	195803601	3999005
50000	50000	19.04 Sec	101.5 mSec	1244310857	25009431
100000	100000	78.46 Sec	358.0 mSec	4931215181	100007235

The times include QuickSorting both arrays in the optimized routine. It seems to me that the "sorting before" approach is very reasonable. It does not cause execessive overload and makes possible a speed increase of over 200-fold for the 100000 items arrays. I have measured the time spent in QuickSorting and in this case it consumes 69ms out of the total 358 in sorting both 100000-item arrays. So most of the time is spent in running the CLES algorithm.
It would be interesting what is the impact of using dynamic arrays on routine speed. I may try it this afternoon if the soccer championship and the Tour de France are not very interesting.

Logged

VTwin

Hero Member
Posts: 1215
Former Turbo Pascal 3 user

Re: Can this algorithm be made more efficient?

« Reply #49 on: July 07, 2018, 03:46:41 pm »

Quote from: juanba on July 07, 2018, 10:09:02 am

In my solution I use static arrays, QuickSort and (integer)counting of the "Greater than" and "Equal to" events. I am not completely sure of the correctness of the program but I have checked the results against the original solution and they match 100% every time.
...
It would be interesting what is the impact of using dynamic arrays on routine speed. I may try it this afternoon if the soccer championship and the Tour de France are not very interesting.

I am not completely sure of the correctness of your program either. Enjoy the games.

Logged

VTwin

Hero Member
Posts: 1215
Former Turbo Pascal 3 user

Re: Can this algorithm be made more efficient?

« Reply #50 on: July 07, 2018, 04:08:29 pm »

To elaborate:

4. That is handled in the algorithm, and the function returns 0 or 1.

@Thaddy

« Last Edit: July 07, 2018, 04:11:44 pm by VTwin »

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Can this algorithm be made more efficient?

« Reply #51 on: July 07, 2018, 04:13:39 pm »

Quote from: VTwin on July 07, 2018, 04:08:29 pm

To elaborate:

4. That is handled in the algorithm, and the function returns 0.

@Thaddy

Actually 42 ms (on a raspberry pi 3) but cheers,it is magnitudes faster.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

440bx

Hero Member
Posts: 4033

Re: Can this algorithm be made more efficient?

« Reply #52 on: July 07, 2018, 08:16:18 pm »

Quote from: VTwin on July 07, 2018, 02:55:00 am

1. Within the range of an IEEE double.
2. That is handled in the algorithm.
3. The function returns 0.5.
4. That is handled in the algorithm.

Cheers,
VTwin

You just discovered the fountain of youth. Have your age stored in an IEEE double to ensure you have a long life.

Cheers!.

Logged

(FPC v3.0.4 and Lazarus 1.8.2) or (FPC v3.2.2 and Lazarus v3.2) on Windows 7 SP1 64bit.

juanba

Newbie
Posts: 3

Re: Can this algorithm be made more efficient?

« Reply #53 on: July 08, 2018, 04:34:52 pm »

Quote from: segfault on July 04, 2018, 12:00:46 pm

The algorithm computes something called the "common language effect size" (cles) in statistics. The core of it is simple : given 2 lists of numbers or "scores" you compare all pairs in the lists and if the score in list1 is higher than the score in list2 you add 1 to the numerator of the effect size fraction and if the scores are equal you add 0.5 to the numerator. Finally you divide the numerator by the total number of comparisons (length(list1) * length(list2)).
Two versions of this algorithm are available on Github in Python. The above algorithm is included but is described as "naive", however not knowing Python I don't understand how the other more efficient one works. With just a small number of values in the lists it's not too bad but with thousands it will be very slow. Any suggestions? Thanks.

Although it has been discussed in some (or many) of the previous posts I failed to grasp the
importance of repeated or equal values in the arrays. The algorithm has to find the equality of
elemntes between one array and the other, which suggests (at least to me) that it is likely that
there are repeated values within each of the arrays. The example given in the initial post contains
duplicated values in both arrays. If that is the case this feature can be used to speed up the
algorithm provided of course that the arrays are ordered. I heve developed an algorithm based on these
ideas:
- Do not do the calculations inside the loop. Rather count how many s1 > s2 ("GreaterThan") and how
many s1 = s2 ("EqualTo") events are and do the math at the end.
- Start the loop at the High end of the arrays and work all the way down to 1 (for results comparable
to the BruteForce approach) or 0. Working this way makes the program more understandable, at
least to me.
- Consider the sorted arrays as a sucession of chunks or segments with one or more equal values.
- For each segment in set1 find the segment in set2 which has the same value. If it exists, the
number of EqualTo events is the same as the length of the segment. Then find the higher segment
in set2 that is less than the value of the segment in set1. The number of GreaterThan events is
given by the highest position of the set2 segment.
- Once the values GreaterThan and EqualTo are calculated for the segment in set1 multiply them times
the length of this segment since all its items have the same value and therefore get the same
events. This can save a lot of time. If for instance the average segment is 10 items long, the
total time would be reduced to roughly one tenth.
- Move down to the previous (lower index since I work top to bottom) set1 segment and repeat the
process until the beginning of set1 is reached.
- And now do the calc. Watch for overflows that could happen with more than 50000 items in the arrays.

I have programmed this algorithm in a fully commented procedure (name CalcCles2) and have used it
to build a little test program (inspired in one of the examples posted) and will try to attach it
here. I am getting run times of 3,5 Ticks versus 3800 ticks of the bruteforce approach for 20000
item arrays. And the two calculations are coincidental to at least 9 decimal places.
And yes, I think this time I got it right.
Anyway it would be interesting to know if my asumptions about repeated values are correct,
Cheers.

CalcCles.zip (3.46 kB - downloaded 76 times.)

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Can this algorithm be made more efficient?

« Reply #54 on: July 08, 2018, 08:44:59 pm »

- As OP stated, his data is not normal.
- As he also stated the implementation is for the exact solution, not the guesstimate.
And the solution for the latter was already given too. Read the whole thread.

There's nothing worse than a programmer trying to understand statistics....

« Last Edit: July 08, 2018, 08:47:54 pm by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

egsuh

Hero Member
Posts: 1292

Re: Can this algorithm be made more efficient?

« Reply #55 on: July 10, 2018, 07:54:40 am »

What is your intended result, if

set1 = [5];
set2 = [5,5,5]; // this is possible as this is array, net set.

Then should the result (num) 1.5?

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Can this algorithm be made more efficient?

« Reply #56 on: July 10, 2018, 09:36:58 am »

[This syntax is not possible if "set2" is a dynamic array.
set2 = [5,5,5]; // this is wrong syntax for type declaration for dynamic arrays. You can't assign and auto-construct in the type declaration.

This is possible (in trunk), however:
set2 := [5,5,5]; // auto constructor to dynamic array of 3 elements that all have value 5

But set2 needs to be a dynamic array, not a set...

Of course you can not call a constructor in a type declaration

« Last Edit: July 10, 2018, 09:39:41 am by Thaddy »

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

segfault

Full Member
Posts: 107

Re: Can this algorithm be made more efficient?

« Reply #57 on: July 10, 2018, 10:50:12 am »

Hi Thaddy, I was wondering if you could initialise a dynamic array before I saw your post, and according to the docs you can use the type.create() syntax, eg :

Code: Pascal [Select][+]

type
  Tvector = array of real;
var
   v1, v2 : Tvector; 
begin
  v1 := Tvector.create(5,5,5);
  v2 := Tvector.create(5);
end.

BTW, I shouldn't have named those arrays "set1" and "set2" in my initial post because it's bit misleading (they're not sets in the pascal sense).

« Last Edit: July 10, 2018, 10:52:03 am by segfault »

Logged

Thaddy

Hero Member
Posts: 14373
Sensorship about opinions does not belong here.

Re: Can this algorithm be made more efficient?

« Reply #58 on: July 10, 2018, 12:14:33 pm »

Yes, but in trunk you have the implicit array constructor as per my code example:

Code: Pascal [Select][+]

{$mode objfpc}
var
  x:array of integer;
  v:integer;
begin
  // use implicit array constructor;
 x :=[5,6,7,8,9];
 for v in x do write(v:3);
end.

Logged

Object Pascal programmers should get rid of their "component fetish" especially with the non-visuals.

VTwin

Hero Member
Posts: 1215
Former Turbo Pascal 3 user

Re: Can this algorithm be made more efficient?

« Reply #59 on: July 10, 2018, 04:37:03 pm »

Quote from: egsuh on July 10, 2018, 07:54:40 am

What is your intended result, if

set1 = [5];
set2 = [5,5,5]; // this is possible as this is array, net set.

Then should the result (num) 1.5?

Assuming you corrected the syntax, the result would be 0.5.

Logged

Lazarus

Bookstore

Search

Recent

Author Topic: Can this algorithm be made more efficient? (Read 18820 times)

440bx

Re: Can this algorithm be made more efficient?

VTwin

Re: Can this algorithm be made more efficient?

Thaddy

Re: Can this algorithm be made more efficient?

juanba

Re: Can this algorithm be made more efficient?

VTwin

Re: Can this algorithm be made more efficient?

VTwin

Re: Can this algorithm be made more efficient?

Thaddy

Re: Can this algorithm be made more efficient?

440bx

Re: Can this algorithm be made more efficient?

juanba

Re: Can this algorithm be made more efficient?

Thaddy

Re: Can this algorithm be made more efficient?

egsuh

Re: Can this algorithm be made more efficient?

Thaddy

Re: Can this algorithm be made more efficient?

segfault

Re: Can this algorithm be made more efficient?

Thaddy

Re: Can this algorithm be made more efficient?

VTwin

Re: Can this algorithm be made more efficient?

	Computer Math and Games in Pascal (preview)
	Lazarus Handbook