Why do people say there is modulo bias when using a random number generator?

前端 未结 10 1199
一整个雨季
一整个雨季 2020-11-21 05:48

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly the

相关标签:
10条回答
  • 2020-11-21 06:28

    So rand() is a pseudo-random number generator which chooses a natural number between 0 and RAND_MAX, which is a constant defined in cstdlib (see this article for a general overview on rand()).

    Now what happens if you want to generate a random number between say 0 and 2? For the sake of explanation, let's say RAND_MAX is 10 and I decide to generate a random number between 0 and 2 by calling rand()%3. However, rand()%3 does not produce the numbers between 0 and 2 with equal probability!

    When rand() returns 0, 3, 6, or 9, rand()%3 == 0. Therefore, P(0) = 4/11

    When rand() returns 1, 4, 7, or 10, rand()%3 == 1. Therefore, P(1) = 4/11

    When rand() returns 2, 5, or 8, rand()%3 == 2. Therefore, P(2) = 3/11

    This does not generate the numbers between 0 and 2 with equal probability. Of course for small ranges this might not be the biggest issue but for a larger range this could skew the distribution, biasing the smaller numbers.

    So when does rand()%n return a range of numbers from 0 to n-1 with equal probability? When RAND_MAX%n == n - 1. In this case, along with our earlier assumption rand() does return a number between 0 and RAND_MAX with equal probability, the modulo classes of n would also be equally distributed.

    So how do we solve this problem? A crude way is to keep generating random numbers until you get a number in your desired range:

    int x; 
    do {
        x = rand();
    } while (x >= n);
    

    but that's inefficient for low values of n, since you only have a n/RAND_MAX chance of getting a value in your range, and so you'll need to perform RAND_MAX/n calls to rand() on average.

    A more efficient formula approach would be to take some large range with a length divisible by n, like RAND_MAX - RAND_MAX % n, keep generating random numbers until you get one that lies in the range, and then take the modulus:

    int x;
    
    do {
        x = rand();
    } while (x >= (RAND_MAX - RAND_MAX % n));
    
    x %= n;
    

    For small values of n, this will rarely require more than one call to rand().


    Works cited and further reading:

    • CPlusPlus Reference

    • Eternally Confuzzled


    0 讨论(0)
  • 2020-11-21 06:30

    @user1413793 is correct about the problem. I'm not going to discuss that further, except to make one point: yes, for small values of n and large values of RAND_MAX, the modulo bias can be very small. But using a bias-inducing pattern means that you must consider the bias every time you calculate a random number and choose different patterns for different cases. And if you make the wrong choice, the bugs it introduces are subtle and almost impossible to unit test. Compared to just using the proper tool (such as arc4random_uniform), that's extra work, not less work. Doing more work and getting a worse solution is terrible engineering, especially when doing it right every time is easy on most platforms.

    Unfortunately, the implementations of the solution are all incorrect or less efficient than they should be. (Each solution has various comments explaining the problems, but none of the solutions have been fixed to address them.) This is likely to confuse the casual answer-seeker, so I'm providing a known-good implementation here.

    Again, the best solution is just to use arc4random_uniform on platforms that provide it, or a similar ranged solution for your platform (such as Random.nextInt on Java). It will do the right thing at no code cost to you. This is almost always the correct call to make.

    If you don't have arc4random_uniform, then you can use the power of opensource to see exactly how it is implemented on top of a wider-range RNG (ar4random in this case, but a similar approach could also work on top of other RNGs).

    Here is the OpenBSD implementation:

    /*
     * Calculate a uniformly distributed random number less than upper_bound
     * avoiding "modulo bias".
     *
     * Uniformity is achieved by generating new random numbers until the one
     * returned is outside the range [0, 2**32 % upper_bound).  This
     * guarantees the selected random number will be inside
     * [2**32 % upper_bound, 2**32) which maps back to [0, upper_bound)
     * after reduction modulo upper_bound.
     */
    u_int32_t
    arc4random_uniform(u_int32_t upper_bound)
    {
        u_int32_t r, min;
    
        if (upper_bound < 2)
            return 0;
    
        /* 2**32 % x == (2**32 - x) % x */
        min = -upper_bound % upper_bound;
    
        /*
         * This could theoretically loop forever but each retry has
         * p > 0.5 (worst case, usually far better) of selecting a
         * number inside the range we need, so it should rarely need
         * to re-roll.
         */
        for (;;) {
            r = arc4random();
            if (r >= min)
                break;
        }
    
        return r % upper_bound;
    }
    

    It is worth noting the latest commit comment on this code for those who need to implement similar things:

    Change arc4random_uniform() to calculate 2**32 % upper_bound as -upper_bound % upper_bound. Simplifies the code and makes it the same on both ILP32 and LP64 architectures, and also slightly faster on LP64 architectures by using a 32-bit remainder instead of a 64-bit remainder.

    Pointed out by Jorden Verwer on tech@ ok deraadt; no objections from djm or otto

    The Java implementation is also easily findable (see previous link):

    public int nextInt(int n) {
       if (n <= 0)
         throw new IllegalArgumentException("n must be positive");
    
       if ((n & -n) == n)  // i.e., n is a power of 2
         return (int)((n * (long)next(31)) >> 31);
    
       int bits, val;
       do {
           bits = next(31);
           val = bits % n;
       } while (bits - val + (n-1) < 0);
       return val;
     }
    
    0 讨论(0)
  • 2020-11-21 06:33

    Keep selecting a random is a good way to remove the bias.

    Update

    We could make the code fast if we search for an x in range divisible by n.

    // Assumptions
    // rand() in [0, RAND_MAX]
    // n in (0, RAND_MAX]
    
    int x; 
    
    // Keep searching for an x in a range divisible by n 
    do {
        x = rand();
    } while (x >= RAND_MAX - (RAND_MAX % n)) 
    
    x %= n;
    

    The above loop should be very fast, say 1 iteration on average.

    0 讨论(0)
  • 2020-11-21 06:34

    Modulo reduction is a commonly seen way to make a random integer generator avoid the worst case of running forever.

    However, there is no way to "fix" this worst case without introducing bias. It's not just modulo reduction (rand() % n, discussed in the accepted answer) that will introduce bias this way, but also the "multiply-and-shift" reduction of Daniel Lemire, or if you stop rejecting an outcome after a set number of iterations.

    Here is the reason why, and here we will assume we have a "true" random generator that can produce unbiased and independent random bits.*

    In 1976, D. E. Knuth and A. C. Yao showed that any algorithm that produces random integers with a given probability, using only random bits, can be represented as a binary tree, where random bits indicate which way to traverse the tree and each leaf (endpoint) corresponds to an outcome. In this case, we're dealing with algorithms that generate random integers in [0, n), where each integer is chosen with probability 1/n. But if 1/n has a non-terminating binary expansion (which will be the case if n is not a power of 2), this binary tree will necessarily either—

    • have an "infinite" depth, or
    • include "rejection" leaves at the end of the tree,

    and in either case, the algorithm won't run in constant time and will run forever in the worst case. (On the other hand, when n is a power of 2, the optimal binary tree will have a finite depth and no rejection nodes.)

    The binary tree concept also shows that any way to "fix" this worst-case time complexity will lead to bias in general. For instance, modulo reductions are equivalent to a binary tree in which rejection leaves are replaced with labeled outcomes — but since there are more possible outcomes than rejection leaves, only some of the outcomes can take the place of the rejection leaves, introducing bias. The same kind of binary tree — and the same kind of bias — results if you stop rejecting after a set number of iterations. (However, this bias may be negligible depending on the application. There are also security aspects to random integer generation, which are too complicated to discuss in this answer.)

    To illustrate, the following JavaScript code implements a random integer algorithm called the Fast Dice Roller by J. Lumbroso (2013). Note that it includes a rejection event and a loop which are necessary to make the algorithm unbiased in the general case.

    function randomInt(minInclusive, maxExclusive) {
      var maxInclusive = (maxExclusive - minInclusive) - 1
      var x = 1
      var y = 0
      while(true) {
        x = x * 2
        var randomBit = (Math.random() < 0.5 ? 0 : 1)
        y = y * 2 + randomBit
        if(x > maxInclusive) {
          if (y <= maxInclusive) { return y + minInclusive }
          // Rejection
          x = x - maxInclusive - 1
          y = y - maxInclusive - 1
        }
      }
    }
    

    Note

    * This answer won't involve the rand() function in C because it has many issues. Perhaps the most serious here is the fact that the C standard does not specify a particular distribution for the numbers returned by rand().

    0 讨论(0)
  • 2020-11-21 06:36

    Definition

    Modulo Bias is the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.

    This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.

    Problem Example

    Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:

    000 = 0, 001 = 1, 010 = 2, 011 = 3
    100 = 4, 101 = 5, 110 = 6, 111 = 7
    

    We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo bias problem: 110 yields a 0, and 111 yields a 1. This die is loaded.

    Potential Solutions

    Approach 0:

    Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).

    Approach 1:

    Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110 and 111 and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rolls themselves. This is clearly impractical for all but the most trivial of uses.

    Approach 2:

    Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.

    Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.

    Sounds good at first, but let's check the math:

    4 discarded results / 16 possibilities = 25%
    

    In this case, 1 extra bit didn't help at all!

    That result is unfortunate, but let's try again with 5 bits:

    32 % 6 = 2 discarded results; and
    2 discarded results / 32 possibilities = 6.25%
    

    A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.

    As demonstrated however, adding an 1 extra bit may not change anything. In fact if we increase our roll to 6 bits, the probability remains 6.25%.

    This begs 2 additional questions:

    1. If we add enough bits, is there a guarantee that the probability of a discard will diminish?
    2. How many bits are enough in the general case?

    General Solution

    Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,

    [2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)
    

    Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.

    Without offering further proof, in general using double the number of bits required will provide a smaller, usually insignificant, chance of a discard.

    Proof of Concept

    Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto which most everyone should have available.

    #include <iostream>
    #include <assert.h>
    #include <limits>
    #include <openssl/rand.h>
    
    volatile uint32_t dummy;
    uint64_t discardCount;
    
    uint32_t uniformRandomUint32(uint32_t upperBound)
    {
        assert(RAND_status() == 1);
        uint64_t discard = (std::numeric_limits<uint64_t>::max() - upperBound) % upperBound;
        uint64_t randomPool = RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
    
        while(randomPool > (std::numeric_limits<uint64_t>::max() - discard)) {
            RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
            ++discardCount;
        }
    
        return randomPool % upperBound;
    }
    
    int main() {
        discardCount = 0;
    
        const uint32_t MODULUS = (1ul << 31)-1;
        const uint32_t ROLLS = 10000000;
    
        for(uint32_t i = 0; i < ROLLS; ++i) {
            dummy = uniformRandomUint32(MODULUS);
        }
        std::cout << "Discard count = " << discardCount << std::endl;
    }
    

    I encourage playing with the MODULUS and ROLLS values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.

    0 讨论(0)
  • 2020-11-21 06:36

    As the accepted answer indicates, "modulo bias" has its roots in the low value of RAND_MAX. He uses an extremely small value of RAND_MAX (10) to show that if RAND_MAX were 10, then you tried to generate a number between 0 and 2 using %, the following outcomes would result:

    rand() % 3   // if RAND_MAX were only 10, gives
    output of rand()   |   rand()%3
    0                  |   0
    1                  |   1
    2                  |   2
    3                  |   0
    4                  |   1
    5                  |   2
    6                  |   0
    7                  |   1
    8                  |   2
    9                  |   0
    

    So there are 4 outputs of 0's (4/10 chance) and only 3 outputs of 1 and 2 (3/10 chances each).

    So it's biased. The lower numbers have a better chance of coming out.

    But that only shows up so obviously when RAND_MAX is small. Or more specifically, when the number your are modding by is large compared to RAND_MAX.

    A much better solution than looping (which is insanely inefficient and shouldn't even be suggested) is to use a PRNG with a much larger output range. The Mersenne Twister algorithm has a maximum output of 4,294,967,295. As such doing MersenneTwister::genrand_int32() % 10 for all intents and purposes, will be equally distributed and the modulo bias effect will all but disappear.

    0 讨论(0)
提交回复
热议问题