Why do people say there is modulo bias when using a random number generator?

前端未结

关注

 10  1216

一整个雨季 2020-11-21 05:48

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly the

10条回答

悲哀的现实 (楼主)

2020-11-21 06:39
Mark's Solution (The accepted solution) is Nearly Perfect.
```
int x;

do {
    x = rand();
} while (x >= (RAND_MAX - RAND_MAX % n));

x %= n;
```
edited Mar 25 '16 at 23:16

Mark Amery 39k21170211
However, it has a caveat which discards 1 valid set of outcomes in any scenario where RAND_MAX (RM) is 1 less than a multiple of N (Where N = the Number of possible valid outcomes).

ie, When the 'count of values discarded' (D) is equal to N, then they are actually a valid set (V), not an invalid set (I).

What causes this is at some point Mark loses sight of the difference between N and Rand_Max.

N is a set who's valid members are comprised only of Positive Integers, as it contains a count of responses that would be valid. (eg: Set N = {1, 2, 3, ... n } )

Rand_max However is a set which ( as defined for our purposes ) includes any number of non-negative integers.

In it's most generic form, what is defined here as Rand Max is the Set of all valid outcomes, which could theoretically include negative numbers or non-numeric values.

Therefore Rand_Max is better defined as the set of "Possible Responses".

However N operates against the count of the values within the set of valid responses, so even as defined in our specific case, Rand_Max will be a value one less than the total number it contains.

Using Mark's Solution, Values are Discarded when: X => RM - RM % N
```
EG: 

Ran Max Value (RM) = 255
Valid Outcome (N) = 4

When X => 252, Discarded values for X are: 252, 253, 254, 255

So, if Random Value Selected (X) = {252, 253, 254, 255}

Number of discarded Values (I) = RM % N + 1 == N

 IE:

 I = RM % N + 1
 I = 255 % 4 + 1
 I = 3 + 1
 I = 4

   X => ( RM - RM % N )
 255 => (255 - 255 % 4) 
 255 => (255 - 3)
 255 => (252)

 Discard Returns $True
```
As you can see in the example above, when the value of X (the random number we get from the initial function) is 252, 253, 254, or 255 we would discard it even though these four values comprise a valid set of returned values.

IE: When the count of the values Discarded (I) = N (The number of valid outcomes) then a Valid set of return values will be discarded by the original function.

If we describe the difference between the values N and RM as D, ie:
```
D = (RM - N)
```
Then as the value of D becomes smaller, the Percentage of unneeded re-rolls due to this method increases at each natural multiplicative. (When RAND_MAX is NOT equal to a Prime Number this is of valid concern)

EG:
```
RM=255 , N=2 Then: D = 253, Lost percentage = 0.78125%

RM=255 , N=4 Then: D = 251, Lost percentage = 1.5625%
RM=255 , N=8 Then: D = 247, Lost percentage = 3.125%
RM=255 , N=16 Then: D = 239, Lost percentage = 6.25%
RM=255 , N=32 Then: D = 223, Lost percentage = 12.5%
RM=255 , N=64 Then: D = 191, Lost percentage = 25%
RM=255 , N= 128 Then D = 127, Lost percentage = 50%
```
Since the percentage of Rerolls needed increases the closer N comes to RM, this can be of valid concern at many different values depending on the constraints of the system running he code and the values being looked for.

To negate this we can make a simple amendment As shown here:
```
 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );
 
 x %= n;
```
This provides a more general version of the formula which accounts for the additional peculiarities of using modulus to define your max values.

Examples of using a small value for RAND_MAX which is a multiplicative of N.

Mark'original Version:
```
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X >= (RAND_MAX - ( RAND_MAX % n ) )
When X >= 2 the value will be discarded, even though the set is valid.
```
Generalized Version 1:
```
RAND_MAX = 3, n = 2, Values in RAND_MAX = 0,1,2,3, Valid Sets = 0,1 and 2,3.
When X > (RAND_MAX - ( ( RAND_MAX % n  ) + 1 ) % n )
When X > 3 the value would be discarded, but this is not a vlue in the set RAND_MAX so there will be no discard.
```
Additionally, in the case where N should be the number of values in RAND_MAX; in this case, you could set N = RAND_MAX +1, unless RAND_MAX = INT_MAX.

Loop-wise you could just use N = 1, and any value of X will be accepted, however, and put an IF statement in for your final multiplier. But perhaps you have code that may have a valid reason to return a 1 when the function is called with n = 1...

So it may be better to use 0, which would normally provide a Div 0 Error, when you wish to have n = RAND_MAX+1

Generalized Version 2:
```
int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) );

    x %= n;
} else {
    x = rand();
}
```
Both of these solutions resolve the issue with needlessly discarded valid results which will occur when RM+1 is a product of n.

The second version also covers the edge case scenario when you need n to equal the total possible set of values contained in RAND_MAX.

The modified approach in both is the same and allows for a more general solution to the need of providing valid random numbers and minimizing discarded values.

To reiterate:

The Basic General Solution which extends mark's example:
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

 int x;
 
 do {
     x = rand();
 } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );
 
 x %= n;
```
The Extended General Solution which Allows one additional scenario of RAND_MAX+1 = n:
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x;

if n != 0 {
    do {
        x = rand();
    } while (x > (RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) ) );

    x %= n;
} else {
    x = rand();
}
```
In some languages ( particularly interpreted languages ) doing the calculations of the compare-operation outside of the while condition may lead to faster results as this is a one-time calculation no matter how many re-tries are required. YMMV!
```
// Assumes:
//  RAND_MAX is a globally defined constant, returned from the environment.
//  int n; // User input, or externally defined, number of valid choices.

int x; // Resulting random number
int y; // One-time calculation of the compare value for x

y = RAND_MAX - ( ( ( RAND_MAX % n ) + 1 ) % n) 

if n != 0 {
    do {
        x = rand();
    } while (x > y);

    x %= n;
} else {
    x = rand();
}
```
0 讨论(0)

查看其它10个回答
发布评论:

提交评论
- 加载中...