Why do people say there is modulo bias when using a random number generator?

前端未结

关注

 10  1227

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly the

相关标签:

10条回答

遥遥无期

2020-11-21 06:36

I just wrote a code for Von Neumann's Unbiased Coin Flip Method, that should theoretically eliminate any bias in the random number generation process. More info can be found at (http://en.wikipedia.org/wiki/Fair_coin)

int unbiased_random_bit() {    
    int x1, x2, prev;
    prev = 2;
    x1 = rand() % 2;
    x2 = rand() % 2;

    for (;; x1 = rand() % 2, x2 = rand() % 2)
    {
        if (x1 ^ x2)      // 01 -> 1, or 10 -> 0.
        {
            return x2;        
        }
        else if (x1 & x2)
        {
            if (!prev)    // 0011
                return 1;
            else
                prev = 1; // 1111 -> continue, bias unresolved
        }
        else
        {
            if (prev == 1)// 1100
                return 0;
            else          // 0000 -> continue, bias unresolved
                prev = 0;
        }
    }
}

0 讨论(0)

悲哀的现实

2020-11-21 06:39
Mark's Solution (The accepted solution) is Nearly Perfect.
```
int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;
```
edited Mar 25 '16 at 23:16

Mark Amery 39k21170211
However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX (RM) is 1 less than a multiple of N (Where N = the Number of possible valid outcomes).

ie, When the 'count of values discarded' (D) is equal to N, then they are actually a valid set (V), not an invalid set (I).

What causes this is at some point Mark loses sight of the difference between N and Rand_Max.

N is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N = {1, 2, 3, ... n } )

Rand_max However is a set which ( as defined for our purposes ) includes any number of non-negative integers.

In it's most generic form, what is defined here as Rand Max is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.

Therefore Rand_Max is better defined as the set of "Possible Responses".

However N operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max will be a value one less than the total number it contains.

Using Mark's Solution, Values are Discarded when: X => RM - RM % N
```
EG: 

Ran Max Value (RM) = 255
Valid Outcome (N) = 4

When X => 252, Discarded values for X are: 252, 253, 254, 255

So, if Random Value Selected (X) = {252, 253, 254, 255}

Number of discarded Values (I) = RM % N + 1 == N

 IE:

 I = RM % N + 1
 I = 255 % 4 + 1
 I = 3 + 1
 I = 4

   X => ( RM - RM % N )
 255 => (255 - 255 % 4) 
 255 => (255 - 3)
 255 => (252)

 Discard Returns $True
```
As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.

IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.

If we describe the difference between the values N and RM as D, ie:
```
D = (RM - N)
```
Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)

EG:
```
RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%

RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%
```
Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.

To negate this we can make a simple amendment As shown here:
```
 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
 
 x %= n;
```
This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.

Examples of using a small value for RAND_MAX which is a multiplicative of N.

Mark'original Version:
```
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.
```
Generalized Version 1:
```
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n  ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.
```
Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.

Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...

So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1

Generalized Version 2:
```
int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );

    x %= n;
} else {
    x = rand();
}
```
Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.

The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.

The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.

To reiterate:

The Basic General Solution which extends mark's example:
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
 
 x %= n;
```
The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );

    x %= n;
} else {
    x = rand();
}
```
In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x; // Resulting random number
int y; // One-time calculation of the compare value for x

y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) 

if n != 0 {
    do {
        x = rand();
    } while (x > y);

    x %= n;
} else {
    x = rand();
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-11-21 06:51
With a RAND_MAX value of 3 (in reality it should be much higher than that but the bias would still exist) it makes sense from these calculations that there is a bias:

1 % 2 = 1 2 % 2 = 0 3 % 2 = 1 random_between(1, 3) % 2 = more likely a 1

In this case, the % 2 is what you shouldn't do when you want a random number between 0 and 1. You could get a random number between 0 and 2 by doing % 3 though, because in this case: RAND_MAX is a multiple of 3.

Another method

There is much simpler but to add to other answers, here is my solution to get a random number between 0 and n - 1, so n different possibilities, without bias.
- the number of bits (not bytes) needed to encode the number of possibilities is the number of bits of random data you'll need
- encode the number from random bits
- if this number is >= n, restart (no modulo).
Really random data is not easy to obtain, so why use more bits than needed.

Below is an example in Smalltalk, using a cache of bits from a pseudo-random number generator. I'm no security expert so use at your own risk.
```
next: n

    | bitSize r from to |
    n < 0 ifTrue: [^0 - (self next: 0 - n)].
    n = 0 ifTrue: [^nil].
    n = 1 ifTrue: [^0].
    cache isNil ifTrue: [cache := OrderedCollection new].
    cache size < (self randmax highBit) ifTrue: [
        Security.DSSRandom default next asByteArray do: [ :byte |
            (1 to: 8) do: [ :i |    cache add: (byte bitAt: i)]
        ]
    ].
    r := 0.
    bitSize := n highBit.
    to := cache size.
    from := to - bitSize + 1.
    (from to: to) do: [ :i |
        r := r bitAt: i - from + 1 put: (cache at: i)
    ].
    cache removeFrom: from to: to.
    r >= n ifTrue: [^self next: n].
    ^r
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-11-21 06:53
There are two usual complaints with the use of modulo.
- one is valid for all generators. It is easier to see in a limit case. If your generator has a RAND_MAX which is 2 (that isn't compliant with the C standard) and you want only 0 or 1 as value, using modulo will generate 0 twice as often (when the generator generates 0 and 2) as it will generate 1 (when the generator generates 1). Note that this is true as soon as you don't drop values, whatever the mapping you are using from the generator values to the wanted one, one will occurs twice as often as the other.
- some kind of generator have their less significant bits less random than the other, at least for some of their parameters, but sadly those parameter have other interesting characteristic (such has being able to have RAND_MAX one less than a power of 2). The problem is well known and for a long time library implementation probably avoid the problem (for instance the sample rand() implementation in the C standard use this kind of generator, but drop the 16 less significant bits), but some like to complain about that and you may have bad luck
Using something like
```
int alea(int n){ 
 assert (0 < n && n <= RAND_MAX); 
 int partSize = 
      n == RAND_MAX ? 1 : 1 + (RAND_MAX-n)/(n+1); 
 int maxUsefull = partSize * n + (partSize-1); 
 int draw; 
 do { 
   draw = rand(); 
 } while (draw > maxUsefull); 
 return draw/partSize; 
}
```
to generate a random number between 0 and n will avoid both problems (and it avoids overflow with RAND_MAX == INT_MAX)

BTW, C++11 introduced standard ways to the the reduction and other generator than rand().
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2