Why do people say there is modulo bias when using a random number generator?

前端 未结 10 1217
一整个雨季
一整个雨季 2020-11-21 05:48

I have seen this question asked a lot but never seen a true concrete answer to it. So I am going to post one here which will hopefully help people understand why exactly the

10条回答
  •  遇见更好的自我
    2020-11-21 06:36

    Definition

    Modulo Bias is the inherent bias in using modulo arithmetic to reduce an output set to a subset of the input set. In general, a bias exists whenever the mapping between the input and output set is not equally distributed, as in the case of using modulo arithmetic when the size of the output set is not a divisor of the size of the input set.

    This bias is particularly hard to avoid in computing, where numbers are represented as strings of bits: 0s and 1s. Finding truly random sources of randomness is also extremely difficult, but is beyond the scope of this discussion. For the remainder of this answer, assume that there exists an unlimited source of truly random bits.

    Problem Example

    Let's consider simulating a die roll (0 to 5) using these random bits. There are 6 possibilities, so we need enough bits to represent the number 6, which is 3 bits. Unfortunately, 3 random bits yields 8 possible outcomes:

    000 = 0, 001 = 1, 010 = 2, 011 = 3
    100 = 4, 101 = 5, 110 = 6, 111 = 7
    

    We can reduce the size of the outcome set to exactly 6 by taking the value modulo 6, however this presents the modulo bias problem: 110 yields a 0, and 111 yields a 1. This die is loaded.

    Potential Solutions

    Approach 0:

    Rather than rely on random bits, in theory one could hire a small army to roll dice all day and record the results in a database, and then use each result only once. This is about as practical as it sounds, and more than likely would not yield truly random results anyway (pun intended).

    Approach 1:

    Instead of using the modulus, a naive but mathematically correct solution is to discard results that yield 110 and 111 and simply try again with 3 new bits. Unfortunately, this means that there is a 25% chance on each roll that a re-roll will be required, including each of the re-rolls themselves. This is clearly impractical for all but the most trivial of uses.

    Approach 2:

    Use more bits: instead of 3 bits, use 4. This yield 16 possible outcomes. Of course, re-rolling anytime the result is greater than 5 makes things worse (10/16 = 62.5%) so that alone won't help.

    Notice that 2 * 6 = 12 < 16, so we can safely take any outcome less than 12 and reduce that modulo 6 to evenly distribute the outcomes. The other 4 outcomes must be discarded, and then re-rolled as in the previous approach.

    Sounds good at first, but let's check the math:

    4 discarded results / 16 possibilities = 25%
    

    In this case, 1 extra bit didn't help at all!

    That result is unfortunate, but let's try again with 5 bits:

    32 % 6 = 2 discarded results; and
    2 discarded results / 32 possibilities = 6.25%
    

    A definite improvement, but not good enough in many practical cases. The good news is, adding more bits will never increase the chances of needing to discard and re-roll. This holds not just for dice, but in all cases.

    As demonstrated however, adding an 1 extra bit may not change anything. In fact if we increase our roll to 6 bits, the probability remains 6.25%.

    This begs 2 additional questions:

    1. If we add enough bits, is there a guarantee that the probability of a discard will diminish?
    2. How many bits are enough in the general case?

    General Solution

    Thankfully the answer to the first question is yes. The problem with 6 is that 2^x mod 6 flips between 2 and 4 which coincidentally are a multiple of 2 from each other, so that for an even x > 1,

    [2^x mod 6] / 2^x == [2^(x+1) mod 6] / 2^(x+1)
    

    Thus 6 is an exception rather than the rule. It is possible to find larger moduli that yield consecutive powers of 2 in the same way, but eventually this must wrap around, and the probability of a discard will be reduced.

    Without offering further proof, in general using double the number of bits required will provide a smaller, usually insignificant, chance of a discard.

    Proof of Concept

    Here is an example program that uses OpenSSL's libcrypo to supply random bytes. When compiling, be sure to link to the library with -lcrypto which most everyone should have available.

    #include 
    #include 
    #include 
    #include 
    
    volatile uint32_t dummy;
    uint64_t discardCount;
    
    uint32_t uniformRandomUint32(uint32_t upperBound)
    {
        assert(RAND_status() == 1);
        uint64_t discard = (std::numeric_limits::max() - upperBound) % upperBound;
        uint64_t randomPool = RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
    
        while(randomPool > (std::numeric_limits::max() - discard)) {
            RAND_bytes((uint8_t*)(&randomPool), sizeof(randomPool));
            ++discardCount;
        }
    
        return randomPool % upperBound;
    }
    
    int main() {
        discardCount = 0;
    
        const uint32_t MODULUS = (1ul << 31)-1;
        const uint32_t ROLLS = 10000000;
    
        for(uint32_t i = 0; i < ROLLS; ++i) {
            dummy = uniformRandomUint32(MODULUS);
        }
        std::cout << "Discard count = " << discardCount << std::endl;
    }
    

    I encourage playing with the MODULUS and ROLLS values to see how many re-rolls actually happen under most conditions. A sceptical person may also wish to save the computed values to file and verify the distribution appears normal.

提交回复
热议问题