问题
I have written the following function
bool random_bool(double probability)
{
double p_scaled = probability * (RAND_MAX+1) - rand();
if ( p_scaled >= 1 ) return true;
if ( p_scaled <= 0 ) return false;
return random_bool( p_scaled );
}
Given, that rand()
generates a number from uniform distribution on {0,1,...,RAND_MAX-1,RAND_MAX}
and numbers from subsequent calls can be treated as independent for all practical purposes except cryptography, this should return true
with probability p
: two if
statements return true
with probability slightly below p
, and false
with the probability slightly above 1-p
, while the recursive call deals with everything else.
However the following test fails:
long long N = 10000000000; //1e10
double p = 10000.0 / N;
int counter = 0;
for (long long i=0;i<N;i++) if (random_bool(p)) counter++;
assert(9672 < counter && counter <= 10330);
The assert statement is designed to fail only in 0.1% of cases. However it fails all the time (with counter
being between 10600 and 10700).
What's wrong?
P.S.: I've seen this question, but it doesn't help...
回答1:
One common defect in random number generators is a slight bias towards smaller results (basically a slight bias towards 0 in high order bits). This often happens when wrapping the RNG internal state to the output range is done using a simple mod, which is biased against high values unless RAND_MAX is a divisor of the size of the internal state. Here's a typical biased mapping implementation:
static unsigned int state;
int rand() {
state = nextState(); /* this actually moves the state from one random value to the next, eg., using a LCG */
return state % RAND_MAX; /* biased */
}
The bias occurs because lower values output an have one more mapping under mod from the state. E.g., if the state can have values 0-9 (10 values), and RAND_MAX is 3 (so values 0-2), then the % 3
operation results in, depending on the state
Output State
0 0 3 6 9
1 1 4 7
2 2 5 8
The result 0 is over-represented because it has a 4/10 chance of being selected, vs 3/10 for the other values.
As an example with more likely values, if the internal RNG state is a 16-integer, and RAND_MAX
is 35767 (as you mentioned it is on your platform), then all the values [0,6000] will be be output for 3 different state values, but the remaining ~30,000 values will only be output for 2 distinct state values - a significant bias. This kind of bias would tend to cause your counter value to be higher than expected (since smaller than uniform returns from rand() favors the p_scaled >= 1
condition.
It would help if you could post the exact implementation of rand() on your platform. If it turns out to be bias in the high bits, you may be able to eliminate this by passing the values you get from rand() through a good hash function, but a better approach is probably just to use a high quality source of random numbers, e.g., the Mersenne Twister . A better generator will also have a larger output range (effective, a higher RAND_MAX), which means your algorithm will suffer fewer retries/less recursion.
Even if the Visual Studio runtime implementation suffers from this defect, it is worth noting that it was probably at least partly an intentional design choice - using a RAND_MAX like 35767 that is relatively prime to the state size (typically a power of 2), ensures better randomness of the lower bits, since the % operation effectively mixes the high and low order bits - and having biased/non-random low order bits is often a bigger problem in practice than a slight bias in the high order bits because of the ubiquity of the caller of rand()
reducing the range using %, which effectively uses only the low order bits for moduli which are powers of 2 (also very common).
回答2:
I tried your code in Linux and the results were actually pretty decent. However, it seems like you are in Windows where RAND_MAX
is around 32768 probably. I'm saying since gcc complained in Linux that RAND_MAX+1
results in integer overflow so I had to add a cast.
So the problem is most likely that either RAND_MAX
is too small or the implementation of rand()
on your system is not very good.
If the source of the problem is the implementation of rand()
, your only option would be to change to another function from a better library. However, if the problem is the first one, you might be able to solve it as follows.
/* change `rand()` to return two concatenated rands */
typedef long long rand_type; /* this type depends on your actual system, you might get away with `int` */
#define BIGGER_RAND_MAX ((RAND_MAX + 2) * RAND_MAX)
rand_type bigger_rand(void)
{
return (rand_type)rand() * (RAND_MAX + 1) + rand();
}
And then try your program with this rand that has a higher range. If the problem persists, most likely it's your rand()
function that is far from random.
Side note: your random_bool
should return bool
, not double
! Since you are checking a double
against zero, that could also be the source of the problem, where you have false positives because the double could be not exactly zero.
回答3:
i think this function's result is relate to the RAND_MAX value, in this case p = 1e-6 , if RAND_MAX equal 9999 then this will never return true
来源:https://stackoverflow.com/questions/21362175/can-rand-in-c-be-used-to-generate-unbiased-bools