Recent

Author Topic: Percentiles: Off by 0.5  (Read 3944 times)

Boleeman

  • Hero Member
  • *****
  • Posts: 1158
Percentiles: Off by 0.5
« on: October 21, 2023, 12:13:31 pm »
I converted some CSharp code that works out percentiles but it seems that the 25th - q1 := Percentile(x, 25); and 75th =  q3 := Percentile(x, 75);  percentiles are off by 0.5

Is it maybe something to do with having doubles, not sure?

Please refer to the attached png for the dodgy result.


Code: Pascal  [Select][+][-]
  1. program percentiles;
  2.  
  3. {$MODE OBJFPC}{$H+}
  4.  
  5. uses Math, SysUtils;
  6.  
  7. function Percentile(sortedData: array of Double; p: Double): Double;
  8. var
  9.   position, leftNumber, rightNumber, n, part: Double;
  10. begin
  11.   if p >= 100.0 then
  12.     Exit(sortedData[Length(sortedData) - 1]);
  13.  
  14.   position := (Length(sortedData) + 1) * p / 100.0;
  15.   leftNumber := 0.0;
  16.   rightNumber := 0.0;
  17.   n := p / 100.0 * (Length(sortedData) - 1) + 1.0;
  18.  
  19.   if position >= 1 then
  20.   begin
  21.     leftNumber := sortedData[Trunc(n) - 1];
  22.     rightNumber := sortedData[Trunc(n)];
  23.   end
  24.   else
  25.   begin
  26.     leftNumber := sortedData[0];
  27.     rightNumber := sortedData[1];
  28.   end;
  29.  
  30.   if leftNumber = rightNumber then
  31.     Exit(leftNumber);
  32.  
  33.   part := n - Trunc(n);
  34.   Result := leftNumber + part * (rightNumber - leftNumber);
  35. end;
  36.  
  37. procedure Main;
  38. var
  39.   x: array[0..19] of Double;
  40.   q1, q2, q3: Double;
  41. begin
  42.   x[0] := 10;
  43.   x[1] := 12;
  44.   x[2] := 14;
  45.   x[3] := 16;
  46.   x[4] := 17;
  47.   x[5] := 19;
  48.   x[6] := 20;
  49.   x[7] := 20;
  50.   x[8] := 21;
  51.   x[9] := 22;
  52.   x[10] := 24;
  53.   x[11] := 27;
  54.   x[12] := 29;
  55.   x[13] := 36;
  56.   x[14] := 38;
  57.   x[15] := 40;
  58.   x[16] := 41;
  59.   x[17] := 43;
  60.   x[18] := 50;
  61.   x[19] := 52;
  62.  
  63.   q1 := Percentile(x, 25);
  64.   q2 := Percentile(x, 50);
  65.   q3 := Percentile(x, 75);
  66.  
  67.   writeln;
  68.   writeln('list of percentiles');
  69.   writeln;
  70.   writeln('1st quartile:', FormatFloat('0.00', q1));
  71.   writeln('2nd quartile:', FormatFloat('0.00', q2));
  72.   writeln('3rd quartile:', FormatFloat('0.00', q3));
  73.   writeln('press enter to continue');
  74.   readln;
  75.  
  76. end;
  77.  
  78. begin
  79.   Main;
  80. end.  

« Last Edit: October 21, 2023, 12:38:02 pm by Boleeman »

Laksen

  • Hero Member
  • *****
  • Posts: 802
    • J-Software
Re: Percentiles: Off by 0.5
« Reply #1 on: October 21, 2023, 12:18:25 pm »
You call the parameter sortedData but you never sort it

Boleeman

  • Hero Member
  • *****
  • Posts: 1158
Re: Percentiles: Off by 0.5
« Reply #2 on: October 21, 2023, 12:29:31 pm »
The numbers in array  x are already sorted manually, in ascending order (just to make it simple)
« Last Edit: October 21, 2023, 12:33:25 pm by Boleeman »

rvk

  • Hero Member
  • *****
  • Posts: 7045
Re: Percentiles: Off by 0.5
« Reply #3 on: October 21, 2023, 12:46:27 pm »
In your image you do (17+19)/2
Why are you not doing that in code?

I'm not sure what percentile should calculate but I imagine your n is wrong.

You can trace trough it (debug) and see what the values are at each point and see where the code is wrong (or at least different from your manual calculation).


Boleeman

  • Hero Member
  • *****
  • Posts: 1158
Re: Percentiles: Off by 0.5
« Reply #4 on: October 21, 2023, 12:58:18 pm »
Ah you are right Rvk     

 Result := (rightNumber + leftNumber) / 2;

gives the correct result

but why does        Result := leftNumber + part * (rightNumber - leftNumber);         fail?





rvk

  • Hero Member
  • *****
  • Posts: 7045
Re: Percentiles: Off by 0.5
« Reply #5 on: October 21, 2023, 01:23:28 pm »
but why does        Result := leftNumber + part * (rightNumber - leftNumber);         fail?
Because that's a completely different function.
And part is probably something you didn't expect.
(It's 0.75 not 0.5 which you expected)

And 17 + 0.75 * (19-17) = 18.5

paweld

  • Hero Member
  • *****
  • Posts: 1644
Re: Percentiles: Off by 0.5
« Reply #6 on: October 21, 2023, 01:29:58 pm »
because part is (p / 100.0 * (Length(sortedData) - 1) + 1.0) - trunc(p / 100.0 * (Length(sortedData) - 1) + 1.0)

in your sample
p = 25
Length(sortedData) = 20

so:
Code: [Select]
(25 / 100 * (20 - 1) + 1) - trunc(25 / 100 * (20 - 1) + 1) =
(0,25 * 19 + 1) - trunc(0,25 * 19 + 1) =
(4,75 + 1) - trunc(0,25 * 19 + 1) =
5,75 - trunc(5,75) =
5.75 - 5 = 0,75
Best regards / Pozdrawiam
paweld

Boleeman

  • Hero Member
  • *****
  • Posts: 1158
Re: Percentiles: Off by 0.5
« Reply #7 on: October 21, 2023, 01:35:13 pm »
Ah now I see.

So   Result := leftNumber + 0.5 * (rightNumber - leftNumber);   

as well as

Result := (rightNumber + leftNumber) / 2;

would also work.

Was getting a bit tired. Thanks all for your help and explanations.

Code: Pascal  [Select][+][-]
  1. program percentiles;
  2.  
  3. {$MODE OBJFPC}{$H+}
  4.  
  5. uses Math, SysUtils;
  6.  
  7. function Percentile(sortedData: array of Double; p: Double): Double;
  8. var
  9.   position, leftNumber, rightNumber, n, part: Double;
  10. begin
  11.   if p >= 100.0 then
  12.     Exit(sortedData[Length(sortedData) - 1]);
  13.  
  14.   position := (Length(sortedData) + 1) * p / 100.0;
  15.   leftNumber := 0.0;
  16.   rightNumber := 0.0;
  17.   n := p / 100.0 * (Length(sortedData) - 1) + 1.0;
  18.  
  19.   if position >= 1 then
  20.   begin
  21.     leftNumber := sortedData[Trunc(n) - 1];
  22.     rightNumber := sortedData[Trunc(n)];
  23.   end
  24.   else
  25.   begin
  26.     leftNumber := sortedData[0];
  27.     rightNumber := sortedData[1];
  28.   end;
  29.  
  30.   if leftNumber = rightNumber then
  31.     Exit(leftNumber);
  32.  
  33.   //part := n - Trunc(n);
  34.   Result := leftNumber + 0.5 * (rightNumber - leftNumber);
  35.   //Result := (rightNumber + leftNumber) / 2;
  36. end;
  37.  
  38. procedure Main;
  39. var
  40.   x: array[0..19] of Double;
  41.   q1, q2, q3: Double;
  42. begin
  43.   x[0] := 10;
  44.   x[1] := 12;
  45.   x[2] := 14;
  46.   x[3] := 16;
  47.   x[4] := 17;
  48.   x[5] := 19;
  49.   x[6] := 20;
  50.   x[7] := 20;
  51.   x[8] := 21;
  52.   x[9] := 22;
  53.   x[10] := 26;
  54.   x[11] := 27;
  55.   x[12] := 29;
  56.   x[13] := 36;
  57.   x[14] := 38;
  58.   x[15] := 40;
  59.   x[16] := 41;
  60.   x[17] := 43;
  61.   x[18] := 50;
  62.   x[19] := 52;
  63.  
  64.   q1 := Percentile(x, 25);
  65.   q2 := Percentile(x, 50);
  66.   q3 := Percentile(x, 75);
  67.  
  68.  
  69.   writeln;
  70.   writeln('list of percentiles');
  71.   writeln;
  72.   writeln('1st quartile:', FormatFloat('0.00', q1));
  73.   writeln('2nd quartile:', FormatFloat('0.00', q2));
  74.   writeln('3rd quartile:', FormatFloat('0.00', q3));
  75.   writeln('press enter to continue');
  76.   readln;
  77.  
  78. end;
  79.  
  80. begin
  81.   Main;
  82. end.
  83.  
« Last Edit: October 21, 2023, 01:40:36 pm by Boleeman »

rvk

  • Hero Member
  • *****
  • Posts: 7045
Re: Percentiles: Off by 0.5
« Reply #8 on: October 21, 2023, 02:32:44 pm »
I don't know what you want to do but if it's calculating the percentile...
I thought the 25 percentile of 1,2,3,7,8,9 is 2.
It will fall on a single number. That's not the case in your function.

The 50 percentile does fall between 3 and 7 and for that you do need to do the middle.

https://www.mathsisfun.com/data/percentiles.html

Or am I missing a point?

wp

  • Hero Member
  • *****
  • Posts: 13578
Re: Percentiles: Off by 0.5
« Reply #9 on: October 21, 2023, 03:38:27 pm »
If you are expecting values out of the data array you probably want at the "nearest-rank method" of wikipedia (https://en.wikipedia.org/wiki/Percentile).

But there are many other definitions...

The one which still is easy to understand is as follows: Draw the 20 indices on a sheet of paper, maybe in 1cm steps, and label them by the index number on one side of the axis and by the data point value on the other side - see the attached screenshot taken from TAChart. The quartiles divide the line between the first (0) and last index (19) in quarters by length. The 1st quartile (25%) is at "index" 19/4 = 4.75, the 2nd quartile, or median, (50%) at "index" 19/2 = 9.5, and the 3rd quartile at "index" 19*3/4 = 14.25. The quartiles are the data values at these indices. But indices are integers. How to get a data value at a fractional index? By interpolation...

Let's begin with the median (50%) because it is more easy to understand: The data value before the division point at index 9.5 is the one at index 9 with value 22, the one after the division point is at index 10 with value 24. The division is right in the center (9.5), therefore we take the simple average between the two data values: (22+24)*0.5 = 23 (or in other words, both values are equally weighted by the factor 0.5: 22*0.5 + 24*0.5) -- this is the median.

The 25%-point is at "index" 4.75. The plot shows that it would be unfair if both neighbors were weighted equally in the average. Since 4.75 is closer to 5 (the first index after the division) the overall value should be closer to x[5] than to x[4]. Precisely: we weight x[5] by the factor 0.75 (the fractional part of 4.75) and x[4] by 1-0.75 = 0.25: x[4]*0.25 + x[5]*0.75 = 17*0.25 + 19*0.75 = 18.5 -- this is the 1st quartile at 25%.

The calculation is the same for the 75% percentile. It is at "index" 14.25, i.e we need the data values as index 14 and 15 and we weight them by the factors 0.75 and 0.25, respectively: 38*0.75 + 40*0.25  = 38.5

You can estimate the same values from the plot in the screenshot by visual inspection.

In my opinion, however, these differences are rather irrelevant. Press and Flannery wrote somewhere in their famous book "Numerical Recipes" about the ominous formula for the standard deviation (where nobody understands why it is divided by N-1 rather than by N) that when it matters whether 1 is subtracted from the number of data values you have a problem anyway because you need a large number of samples in order to do good statistics, and it is really no big difference whether you have 1000 values or 999.
« Last Edit: October 21, 2023, 03:57:45 pm by wp »

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Percentiles: Off by 0.5
« Reply #10 on: October 21, 2023, 04:43:05 pm »
Considering the other question from Boleeman regarding box plot the "percentile" can be obtained using different methods for determining a quartile. See also Five-number summary.
« Last Edit: October 21, 2023, 04:45:42 pm by TRon »
Today is tomorrow's yesterday.

rvk

  • Hero Member
  • *****
  • Posts: 7045
Re: Percentiles: Off by 0.5
« Reply #11 on: October 21, 2023, 06:34:47 pm »
Yes, and according to that example the Q1 (25th percentile) could land on a single number.

In ordered data set: 7, 15, 36, 39, 40, 41 it is always 15 (except for method 4 where it's 13).

But the given function will not return that so it is wrong (back to the drawing board  :D ).

(if Boleeman has any more questions it should be first made clear which method is desired.)

Boleeman

  • Hero Member
  • *****
  • Posts: 1158
Re: Percentiles: Off by 0.5
« Reply #12 on: October 21, 2023, 10:09:18 pm »
I had to go to bed as I was still feeling a bit off.

WP, I did not know of the "nearest-rank method" of wikipedia and the other methods. Thanks for the demo to understand it.

Thanks to all (Laksen, RVK, TRon) for looking into quartiles.


Was wanting to make a boxplot, so I looked at percentiles (this thread) and quartiles (other thread).
I thought there was one standard answer but looks like there are variations.
« Last Edit: October 21, 2023, 11:23:10 pm by Boleeman »

TRon

  • Hero Member
  • *****
  • Posts: 4377
Re: Percentiles: Off by 0.5
« Reply #13 on: October 23, 2023, 08:24:28 am »
I thought there was one standard answer but looks like there are variations.
The wiki is just the tip of the iceberg in that regards.

So far I've counted/encountered six ! different methods of/for calculating quartiles and found an online source that even mentions eight Fifteen. Without the intention to offend anyone in particular but sjees those statistic people  are a strange bunch. For me yet another reason to never trust analyzed data ::)

edit: better search terminology just delivered me a paper from 2009 mentioning 15 different methods. Go figure (literally  :D )

PS: since you seem to be porting your existing software from vb to Lazarus, would it not be easier to just show us how you calculated the quartiles in VB so that we can help translate ?

In case you do not want to do that (which is ok) then you could at least feed your VB program with the numbers on the wikipedia page for quantiles I linked to and show us the resulting quantiles from your VB program so that it is possible to come up with a quantile function for Free Pascal that return the same results or in case not matching with one of the methods used on the wikipedia link try to figure out what method for calculating quantiles you used in VB.
« Last Edit: October 23, 2023, 11:01:30 am by TRon »
Today is tomorrow's yesterday.

 

TinyPortal © 2005-2018