Lazarus

Free Pascal => FPC development => Topic started by: Pangea on July 24, 2019, 10:22:49 pm

Title: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 24, 2019, 10:22:49 pm
As suggested i open a forum thread for the problem discussed in https://bugs.freepascal.org/view.php?id=35878 (https://bugs.freepascal.org/view.php?id=35878). If this is the wrong subforum, please move the thread.

Background information: I noticed that the output of the 64-bit random function appears to be biased towards smaller numbers. I originally thought, it would be enough to adapt the same trick the 32 bit implementation uses to the 64 bit function and opened the bug tracker entry. But this was a mistake, as this also seems to be insufficient. It is possible to construct two really simple programs to show that both (the 32 and the 64-bit version) do not produce unbiased output. I apologize for this confusion.

This is the program to reproduce this bug for the 64-bit version ("function Random(l:int64):int64;")
Code: Pascal  [Select]
  1. program random_test;
  2. {$mode objfpc}{$H+}
  3. {$APPTYPE CONSOLE}
  4. const
  5.    l: UInt64 = 6148914691236517205;
  6. var
  7.    s,n: UInt64;
  8.    i,j: UInt64;
  9. begin
  10.   WriteLn('Experiment:', LineEnding);
  11.   WriteLn(' Draw a natural number r from the intervall [0,l-1] and');
  12.   WriteLn(' increment a counter s when r < l div 2 is satisfied.');
  13.   WriteLn(' Repeat this step n times and calculate the ratio s/n.', LineEnding);
  14.  
  15.   WriteLn(' Expected ratio: ', (l div 2)/l:30, LineEnding);
  16.  
  17.   WriteLn('Input size n':16, 'Observed ratio s/n':30);
  18.   l := 6148914691236517205;
  19.   for j := 4 to 63 do
  20.   begin
  21.     n := (Int64(1) shl j);
  22.     s := 0;
  23.     for i := 0 to n-1 do
  24.       if Random(Int64(l)) < l div 2 then
  25.         Inc(s);
  26.     WriteLn( (UInt64(1) shl j):16, s/n:30);
  27.   end;
  28. end.
  29.  

And this is the code for the 32 bit version ("function Random(l:longint):longint;"):

Code: Pascal  [Select]
  1. program random_test;
  2. {$mode objfpc}{$H+}
  3. {$APPTYPE CONSOLE}
  4. var
  5.    s,n: longint;
  6.    i,j: longint;
  7.    l: LongInt;
  8. begin
  9.   l := 1227133513;
  10.  
  11.   WriteLn('Experiment:', LineEnding);
  12.   WriteLn(' Draw a natural number r from the intervall [0,l-1] and');
  13.   WriteLn(' increment a counter s when r mod 2 = 0 is satisfied.');
  14.   WriteLn(' Repeat this step n times and calculate the ratio s/n.', LineEnding);
  15.  
  16.   WriteLn(' Expected ratio: ', (l div 2)/l:30, LineEnding);
  17.  
  18.   WriteLn('Input size n':16, 'Observed ratio s/n':30);
  19.   for j := 4 to 31 do
  20.   begin
  21.     n := (LongInt(1) shl j);
  22.     s := 0;
  23.     for i := 0 to n-1 do
  24.       if Random(LongInt(l)) mod 2 = 0 then
  25.         Inc(s);
  26.     WriteLn( (LongInt(1) shl j):16, s/n:30);
  27.   end;
  28. end.

Both programs have an expectation value of 1/2, but the current random implementation does not converge to this.

My take on this matter: This is not a problem of the underlying random number generator, but the way the output range of the PRNG is mapped onto the smaller interval. Therefore it is completly possible for the output of this functions to be biased, while the PRNG reproduces the values of the reference implementation. The bias is not introduced in the generator, but in the map. I also think it is necessary to adapt rejection sampling to guarantee an unbiased result.

Feel free to tear this apart. : )

Edit: I tested the programs against version 3.0.4.




Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Martin_fr on July 25, 2019, 01:08:50 am
Actually you appear to be right.

From system unit:
Code: Pascal  [Select]
  1. function random(l:longint): longint;
  2. begin
  3.   { otherwise we can return values = l (JM) }
  4.   if (l < 0) then
  5.     inc(l);
  6.   random := longint((int64(cardinal(genrand_MT19937))*l) shr 32);
  7. end;
  8.  
  9. function random(l:int64): int64;
  10. begin
  11.   { always call random, so the random generator cycles (TP-compatible) (JM) }
  12.   random := int64((qword(cardinal(genrand_MT19937)) or ((qword(cardinal(genrand_MT19937)) shl 32))) and $7fffffffffffffff);
  13.   if (l<>0) then
  14.     random := random mod l
  15.   else
  16.     random := 0;
  17. end;
  18.  

genrand_MT19937 returns a random 32 bit value. Lets assume that part to be as correct as can be.


There is a generic problem.
If you do "Random(100)" you need to map the 2^32 values to 0..99. But 2^31 is not int dividable by 100. So some of 0..99 will be more likely.
Once you evenly distributed ‭4,294,967,200 out of the ‭4,294,967,296‬ values, you have 96 values left, to evenly distribute.
How to you map 96 values to 0..99 so that each of 0..99 is equally likely chosen?


This problem then manifests in both of the implementations
64bit:
    random := random mod l
means that  96 to 99 are slightly less likely.
(If it was QWord) With random (2^63 + n), the last 2*n values are only half as likely to occur.

32bit:
The numbers with different likelihood are equally spread across the requested range.

--
The issue is that any stable function to map the result of genrand_MT19937 to a given range will have the same problem.

You would need additional inputs to map them different (e.g counting how often "random" was called.)
But such additional input may be guess-able. And if it is part of the random result, then an attacker may be able to make predictions based on that.

Of course if an attacker knows your range, and that you use the fpc implementation, said attacker can make guesses too.

--
Since using "Mod" the bias on the numbers depends on how big the range is (relative to the max of genrand_MT19937 ), you can get a bigger random number, by calling genrand_MT19937  several times.
It still will be biased, but statistically less likely.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Thaddy on July 25, 2019, 07:12:48 am
It can be right, it can likely be solved by scaling back and forth by the forrmulea -slightly adjusted - I gave and its inverse. That will give a best fit but is slow. The algorithm itself, however is a standard.
Still: we can't change that, because that is how it worked since TP days.  Existing data will become unusable and that is not acceptable.
(I have quite a lot of such data dating as far back as its introduction in FreePascal. There are billions of such data sets, because the algorithm is so widely used in multiple languages.
My suggestion is, and the focus of a solution I will propose on the mailing list, is to overload and where not possible add differently named functionality.
Caveat is a platform without an FPU must support soft float and that is not always the case so that is frankly not a solution. I will investigate if it is possible to implement the first suggestion, but only for ranged randoms: these are not part of the actual standard. The good news is I have the 64 bit MT working and I am now testing if the output is equal to the original test data at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/mt19937-64.out.txt.
If that works I will work on the problem we discuss here so it can be solved for both 32 an 64 bit.
Reason is that I want the 64 bit version standards compliant first, then add it to my compendium and then propose a patch. Non- Xplatform solutions will not be considered by me.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: ASerge on July 25, 2019, 08:01:37 am
For those who want a quick workaround, I can suggest adding this to the beginning of the test:
Code: Pascal  [Select]
  1. function Random(X: SizeInt): SizeInt;
  2. begin
  3.   Result := Round(System.Random * X);
  4. end;
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 25, 2019, 11:15:04 am
For those who want a quick workaround, I can suggest adding this to the beginning of the test:
Code: Pascal  [Select]
  1. function Random(X: SizeInt): SizeInt;
  2. begin
  3.   Result := Round(System.Random * X);
  4. end;

This still fails when SizeInt has 64 bit and L is large. As a test please consider this adapted version of the 64-bit program (i am using the 64-bit compiler, so sizeint has 64-bit):

Code: Pascal  [Select]
  1. program random_test;
  2.  
  3. {$mode objfpc}{$H+}
  4. {$APPTYPE CONSOLE}
  5.  
  6. function RoundRandom(x: SizeInt): SizeInt;
  7. begin
  8.   Result := round(x*Random);
  9. end;
  10.  
  11. var
  12.    s,n: Int64;
  13.    i,j: Int64;
  14.    l: Int64;
  15. begin
  16.   l := 6148914691236517205;
  17.  
  18.   WriteLn('Experiment:', LineEnding);
  19.   WriteLn(' Draw a natural number r from the intervall [0,l-1] and');
  20.   WriteLn(' increment a counter s when r mod 2 = 0 is satisfied.');
  21.   WriteLn(' Repeat this step n times and calculate the ratio s/n.', LineEnding);
  22.  
  23.   WriteLn(' Expected ratio: ', (l div 2)/l:30, LineEnding);
  24.  
  25.   WriteLn('Input size n':16, 'Observed ratio s/n':30);
  26.   for j := 4 to 31 do
  27.   begin
  28.     n := (Int64(1) shl j);
  29.     s := 0;
  30.     for i := 0 to n-1 do
  31.       if RoundRandom(Int64(l)) mod 2 = 0 then
  32.         Inc(s);
  33.     WriteLn( (Int64(1) shl j):16, s/n:30);
  34.   end;
  35. end.    

So here is an variant using rejection sampling (no floating point arithmetic, just reject the samples you cannot map correctly):

Code: Pascal  [Select]
  1.   function RandomRejectionSampling(L: UInt64): Uint64;
  2.   var
  3.     thr, rem: UInt64;
  4.   begin
  5.     if L = 0 then
  6.       Result := 0
  7.     else
  8.     begin
  9.       // We need to calculate High(UInt64)+1 mod L without using 128 bit arithmetic.
  10.       rem := (High(UInt64) mod L) + 1;
  11.       if rem = L then
  12.         rem := 0;
  13.       // Calculate the threshold
  14.       thr := (High(UInt64) - rem);
  15.       if thr < L-1 then
  16.         thr := L-1;
  17.       // Perform rejection sampling
  18.       repeat
  19.         Result := ((UInt64(GenRand32) shl 32) or UInt64(GenRand32));
  20.       until Result <= thr;
  21.       Result := Result mod L;
  22.     end;
  23.   end;  
  24.  

It's just the very basic idea, so there might be room for optimization.

Edit: Fixed a bug in the code.




Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: ASerge on July 25, 2019, 11:25:45 am
This still fails when SizeInt has 64 bit and L is large. As a test please consider this adapted version of the 64-bit program (i am using the 64-bit compiler, so sizeint has 64-bit):
And where fails? I've added my function to both your implementations. Everything was compiled and started without problems. And the distribution became about 0.5.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 25, 2019, 11:41:58 am
It only produces even numbers for large L. You can test it against the 64 bit program when you change the condition to random(l) mod 2 = 0.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Martin_fr on July 25, 2019, 12:27:01 pm
Quote
Code: Pascal  [Select]
  1.      // Perform rejection sampling
  2.       repeat
  3.         Result := ((UInt64(GenRand32) shl 32) or UInt64(GenRand32));
  4.       until Result <= thr;

Depending on the underlying random, you may have to reject a lot of numbers...

Also this makes the random function, take different amount of time depending on where in the underlaying random sequence it is. Are you sure this will not open the door for timing attacks?
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 25, 2019, 01:38:45 pm
I attached a plot showing the acceptance probability over the choice L. The acceptance should never be smaller than 1/2 (unless there is a bug).

My personal opinion on the timing part is: The underlying PRNG is not cryptographically secure in any way, so securing the random number generation against side channel attacks does not really help.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: ASerge on July 25, 2019, 01:47:38 pm
It only produces even numbers for large L. You can test it against the 64 bit program when you change the condition to random(l) mod 2 = 0.
That's expected. The accuracy of fraction part for double is 52, for large numbers only the exponent is used.
There is no such problem for x32, of course.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Thaddy on July 25, 2019, 04:03:21 pm
My personal opinion on the timing part is: The underlying PRNG is not cryptographically secure in any way, so securing the random number generation against side channel attacks does not really help.
It is not secure, but it is compiled in B+ mode, so Boolean shortcuts have no timing influence because every expression is fully evaluated.. This was done on my request.
[edit]
This is not true for random, but for the hashes. But timing attacks can be resolved by B+.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 25, 2019, 06:53:04 pm
I don't think it is necessary to harden random against timing attacks. It is not designed to be used as cryptographic primitive.

I eliminated an unnecessary comparison and this is, what i would propose:
Code: Pascal  [Select]
  1.   function Random(L: UInt32): UInt32; overload;
  2.   var
  3.     thr, rem: UInt32;
  4.   begin
  5.     if L = 0 then
  6.       Result := 0
  7.     else
  8.     begin
  9.       rem := (UInt64(High(UInt32)) + 1) mod L;
  10.       thr := (High(UInt32) - rem);
  11.       repeat
  12.         Result := genrand_MT19937;
  13.       until Result <= thr;
  14.       Result := Result mod L;
  15.     end;
  16.   end;    
  17.  
  18.   function Random(L: UInt64): UInt64; overload;
  19.   var
  20.     thr, rem: UInt64;
  21.   begin
  22.     if L = 0 then
  23.       Result := 0
  24.     else
  25.     begin
  26.       // We need to calculate High(UInt64)+1 mod L without using 128 bit arithmetic.
  27.       rem := (High(UInt64) mod L) + 1;
  28.       if rem = L then
  29.         rem := 0;
  30.       // Calculate the threshold
  31.       thr := (High(UInt64) - rem);
  32.       repeat
  33.         Result := ((UInt64(genrand_MT19937) shl 32) or UInt64(genrand_MT19937));
  34.       until Result <= thr;
  35.       Result := Result mod L;
  36.     end;
  37.   end;
  38.  

However this does NOT behave like the old implementation. This is especially relevant for former negative input. If negative input is desired, maybe this is more appropriate (Also the code is simpler):

Code: Pascal  [Select]
  1. function Random(L: Int32): Int32; overload;
  2. const
  3.   UpperBound: UInt32 = (High(UInt32) shr 1);
  4.  var
  5.    thr, rem: UInt32;
  6.  begin
  7.    if L <= 0 then
  8.      Result := 0
  9.    else
  10.    begin
  11.      rem := UInt32(UpperBound+1) mod UInt32(L);
  12.      thr := (UpperBound - rem);
  13.      repeat
  14.        Result := genrand_MT19937 and UpperBound;
  15.      until UInt32(Result) <= thr;
  16.      Result := Result mod L;
  17.    end;
  18.  end;
  19.  
  20. function Random(L: Int64): Int64; overload;
  21. const
  22.   UpperBound: UInt64 = (High(UInt64) shr 1);
  23.  var
  24.    thr, rem: UInt64;
  25.  begin
  26.    if L <= 0 then
  27.      Result := 0
  28.    else
  29.    begin
  30.      rem := (UpperBound +1) mod UInt64(L);
  31.      thr := (UpperBound - rem);
  32.      repeat
  33.        Result :=  ((UInt64(genrand_MT19937) shl 32) or UInt64(genrand_MT19937)) and UpperBound;
  34.      until UInt64(Result) <= thr;
  35.      Result := Result mod L;
  36.    end;
  37.  end;    
  38.  

However this still behaves differently for neagtive L in the 64 bit case. Maybe it should be clarified in the documentation what happens when L is smaller or euqal to zero.

Edit: I read the old code wrong. If it is negative it should return a negative number in the respective range. So this is still insufficient.
Edit: It is too hot. I can't think clearly and i do not trust myself. Currently random(Int32(-5)) seems to return numbers in the interval [-4; 0], but random(Int64(-5)) returns numbers in the interval [0, 4]??? Can someone clarify this please?
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: PascalDragon on July 26, 2019, 09:23:42 am
I don't think it is necessary to harden random against timing attacks. It is not designed to be used as cryptographic primitive.
Correct. That's not its purpose.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: devEric69 on July 26, 2019, 10:31:15 am
If it's a question of having a real random number generator (a long sequence of numbers, stored and studied in a spreadsheet program, to see if it's roughly (no computer generator is perfect: it's a question of numbers generated, before it becomes biased. So, the real question is: does the count, of random numbers that I will use, remain more or less uniformly distributed in the considered range of random numbers that interests me, with this generator?) formally distributed according to a bell curve, over a given range of numbers, centered, and according to a normal or other statistical law), then it's preferable to use a more advanced algorithm like Tore, or another one.

My 2 cents.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 27, 2019, 09:53:27 pm
it's preferable to use a more advanced algorithm like Tore
I googled this algorithm and apparently this is a way to construct low discrepancy sequences by taking the fractional part of multiplies of irrational numbers (e.g. the square roots of primes). Is this correct?

I also found [1] and [2], both giving an overview over different sampling methods. I went ahead and implemented them together with a simple check, wether their output is really unbiased (a.k.a. i did not implement a bug). In my opinion the one using the intrinsic (no division) and the one with only a single division are really interesting.

[1]  https://arxiv.org/pdf/1805.10941 (https://arxiv.org/pdf/1805.10941)
[2]  http://www.pcg-random.org/posts/bounded-rands.html  (http://www.pcg-random.org/posts/bounded-rands.html)

Edit: Removed source.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Thaddy on July 27, 2019, 10:30:08 pm
Did you test with at least BigCrush?
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 27, 2019, 11:48:48 pm
I checked TestU01, but i seems like it only accepts powers of 2 as the output range of your PRNG. So i was not able to run the samplers against it.

Edit: If the upper range of the interval is divides the upper bound of the generator, no bias for any sampler is expected.
Edit2: Unless, of course, the underlying PRNG is also biased.
Edit3: Actually, I can convert the output of the sampler to some floating point number and assume some precision. Then i can feed to the testsuit. Maybe i try this.

Edit4: Ok, i tested the samplers against SmallCrush. Find the details and results below. However i don't think i will run this against BigCrush, because completing this takes forever and i cannot run my PC overnight. Also i have read somwhere, that the Mersenne Twister fails some test in big crush? Would this mean it will also fail for the composite number, even when the rejecting maps are working correctly?

Results:

As expected both naive non rejecting methods fail several tests. To be precise:

Code: [Select]
========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        TMulSampler
 Number of statistics:  15
 Total CPU time:   00:00:41.60
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  3  Gap                              eps 
  4  SimpPoker                        eps 
  5  CouponCollector                  eps 
  7  WeightDistrib                    eps 
  9  HammingIndep                     eps 
 10  RandomWalk1 H                    eps 
 10  RandomWalk1 M                    eps 
 10  RandomWalk1 J                    eps 
 10  RandomWalk1 C                   2.5e-5
 ----------------------------------------------
 All other tests were passed


========= Summary results of SmallCrush =========

 Version:          TestU01 1.2.3
 Generator:        TModSampler
 Number of statistics:  15
 Total CPU time:   00:00:41.26
 The following tests gave p-values outside [0.001, 0.9990]:
 (eps  means a value < 1.0e-300):
 (eps1 means a value < 1.0e-15):

       Test                          p-value
 ----------------------------------------------
  2  Collision                        eps 
  6  MaxOft                           eps 
  6  MaxOft AD                      1 - eps1
  7  WeightDistrib                    eps 
  9  HammingIndep                   5.6e-16
 10  RandomWalk1 H                    eps 
 10  RandomWalk1 M                    eps 
 10  RandomWalk1 J                    eps 
 10  RandomWalk1 R                    eps 
 10  RandomWalk1 C                    eps 
 ----------------------------------------------
 All other tests were passed
TMulSampler refers to the multiply and shift method and TModSampler refers to the modulo maping, both (at least in a somewhat similar form) currently implemented in the RTL.

 All other implemented rejecting methods pass SmallCrush without failing any test. I attached the full output log to this post.

Used method:

To feed the output of the samplers into testsuite i construct a floating point number iterative (via Random(L)/(L-1) + Random(L)/(L-1)^2 + ...) until the number has absorbed at least 52 random bits. For the test i used a generator width of 31 bit (backed by the mersenne twister generator in the RTL) and target output interval size of L= 1610612736.

I attached the source code to this post (The code is a bit ugly, but well...). To build it you need fpc, gcc, testu01 installed and all header files and libraries in the respective search path. First call "gcc -c bat.c" and after this build the pascal application. If you do not want to run against testu01, remove the testu01 define in the pascal program and it should compile on any fpc installation. Since all samplers are tested against testu01 directly after each other it is helpful to pipe the output in some file and just watch it with tail -f or something.

Feel free to run it against big or small crush yourself.

I will also remove the attached code on the older forums post, to avoid having two different sources online.

Edit5: Actually it should be Random(L)/L + Random(L)/L^2 + ... . I will rerun the test with the correct formula ....
Edit6: Ok, it yields basically the same results. I updated  the attachments.
Title: Re: Output of Random(SomeInteger) appears to be biased
Post by: Pangea on July 31, 2019, 04:03:02 am
I have seen, that the issue has been marked as resolved. However i still see two minor problems with this:

- The generated output interval for negative input appears to be one to small? Or is this intended?
- The output is still biased.

What is the stance on this? Is the bias in this version of the function deemed acceptable (since it has the same bias the 32 bit version has)?

This is the current code in trunk:
Code: Pascal  [Select]
  1. function random(l:int64): int64;
  2. var
  3.  a, b, c, d: cardinal;
  4.  q, bd, ad, bc, ac: qword;
  5.  carry: qword;
  6. begin
  7.   if (l < 0) then
  8.     begin
  9.       inc(l);
  10.       q:=qword(-l)
  11.     end
  12.   else
  13.     q:=qword(l);
  14.  
  15.   a:=mtwist_u32rand;
  16.  b:=mtwist_u32rand;
  17.  
  18.  c:=q shr 32;
  19.  d:=cardinal(q);
  20.  
  21.  bd:=b*d;
  22.  ad:=a*d;
  23.  bc:=b*c;
  24.  ac:=a*c;
  25.  
  26.  // We only need the carry bit
  27.  carry:=((bd shr 32)+cardinal(ad)+cardinal(bc)) shr 32;
  28.  
  29.  // Calculate the final result
  30.  result:=int64(carry+(ad shr 32)+(bc shr 32)+ac);
  31.  if l<0 then
  32.    result:=-result;
  33. end;
  34.