Given a function which produces a random integer in the range 1 to 5, write a function which produces a random integer in the range 1 to 7.
By using a rolling total, you can both
Both these problems are an issue with the simplistic rand(5)+rand(5)...
-type solutions. The following Python code shows how to implement it (most of this is proving the distribution).
import random
x = []
for i in range (0,7):
x.append (0)
t = 0
tt = 0
for i in range (0,700000):
########################################
##### qq.py #####
r = int (random.random () * 5)
t = (t + r) % 7
########################################
##### qq_notsogood.py #####
#r = 20
#while r > 6:
#r = int (random.random () * 5)
#r = r + int (random.random () * 5)
#t = r
########################################
x[t] = x[t] + 1
tt = tt + 1
high = x[0]
low = x[0]
for i in range (0,7):
print "%d: %7d %.5f" % (i, x[i], 100.0 * x[i] / tt)
if x[i] < low:
low = x[i]
if x[i] > high:
high = x[i]
diff = high - low
print "Variation = %d (%.5f%%)" % (diff, 100.0 * diff / tt)
And this output shows the results:
pax$ python qq.py
0: 99908 14.27257
1: 100029 14.28986
2: 100327 14.33243
3: 100395 14.34214
4: 99104 14.15771
5: 99829 14.26129
6: 100408 14.34400
Variation = 1304 (0.18629%)
pax$ python qq.py
0: 99547 14.22100
1: 100229 14.31843
2: 100078 14.29686
3: 99451 14.20729
4: 100284 14.32629
5: 100038 14.29114
6: 100373 14.33900
Variation = 922 (0.13171%)
pax$ python qq.py
0: 100481 14.35443
1: 99188 14.16971
2: 100284 14.32629
3: 100222 14.31743
4: 99960 14.28000
5: 99426 14.20371
6: 100439 14.34843
Variation = 1293 (0.18471%)
A simplistic rand(5)+rand(5)
, ignoring those cases where this returns more than 6 has a typical variation of 18%, 100 times that of the method shown above:
pax$ python qq_notsogood.py
0: 31756 4.53657
1: 63304 9.04343
2: 95507 13.64386
3: 127825 18.26071
4: 158851 22.69300
5: 127567 18.22386
6: 95190 13.59857
Variation = 127095 (18.15643%)
pax$ python qq_notsogood.py
0: 31792 4.54171
1: 63637 9.09100
2: 95641 13.66300
3: 127627 18.23243
4: 158751 22.67871
5: 126782 18.11171
6: 95770 13.68143
Variation = 126959 (18.13700%)
pax$ python qq_notsogood.py
0: 31955 4.56500
1: 63485 9.06929
2: 94849 13.54986
3: 127737 18.24814
4: 159687 22.81243
5: 127391 18.19871
6: 94896 13.55657
Variation = 127732 (18.24743%)
And, on the advice of Nixuz, I've cleaned the script up so you can just extract and use the rand7...
stuff:
import random
# rand5() returns 0 through 4 inclusive.
def rand5():
return int (random.random () * 5)
# rand7() generator returns 0 through 6 inclusive (using rand5()).
def rand7():
rand7ret = 0
while True:
rand7ret = (rand7ret + rand5()) % 7
yield rand7ret
# Number of test runs.
count = 700000
# Work out distribution.
distrib = [0,0,0,0,0,0,0]
rgen =rand7()
for i in range (0,count):
r = rgen.next()
distrib[r] = distrib[r] + 1
# Print distributions and calculate variation.
high = distrib[0]
low = distrib[0]
for i in range (0,7):
print "%d: %7d %.5f" % (i, distrib[i], 100.0 * distrib[i] / count)
if distrib[i] < low:
low = distrib[i]
if distrib[i] > high:
high = distrib[i]
diff = high - low
print "Variation = %d (%.5f%%)" % (diff, 100.0 * diff / count)
I think I have four answers, two giving exact solutions like that of @Adam Rosenfield but without the infinite loop problem, and other two with almost perfect solution but faster implementation than first one.
The best exact solution requires 7 calls to rand5
, but lets proceed in order to understand.
Strength of Adam's answer is that it gives a perfect uniform distribution, and there is very high probability (21/25) that only two calls to rand5() will be needed. However, worst case is infinite loop.
The first solution below also gives a perfect uniform distribution, but requires a total of 42 calls to rand5
. No infinite loops.
Here is an R implementation:
rand5 <- function() sample(1:5,1)
rand7 <- function() (sum(sapply(0:6, function(i) i + rand5() + rand5()*2 + rand5()*3 + rand5()*4 + rand5()*5 + rand5()*6)) %% 7) + 1
For people not familiar with R, here is a simplified version:
rand7 = function(){
r = 0
for(i in 0:6){
r = r + i + rand5() + rand5()*2 + rand5()*3 + rand5()*4 + rand5()*5 + rand5()*6
}
return r %% 7 + 1
}
The distribution of rand5
will be preserved. If we do the math, each of the 7 iterations of the loop has 5^6 possible combinations, thus total number of possible combinations are (7 * 5^6) %% 7 = 0
. Thus we can divide the random numbers generated in equal groups of 7. See method two for more discussion on this.
Here are all the possible combinations:
table(apply(expand.grid(c(outer(1:5,0:6,"+")),(1:5)*2,(1:5)*3,(1:5)*4,(1:5)*5,(1:5)*6),1,sum) %% 7 + 1)
1 2 3 4 5 6 7
15625 15625 15625 15625 15625 15625 15625
I think it's straight forward to show that Adam's method will run much much faster. The probability that there are 42 or more calls to rand5
in Adam's solution is very small ((4/25)^21 ~ 10^(-17)
).
Now the second method, which is almost uniform, but requires 6 calls to rand5
:
rand7 <- function() (sum(sapply(1:6,function(i) i*rand5())) %% 7) + 1
Here is a simplified version:
rand7 = function(){
r = 0
for(i in 1:6){
r = r + i*rand5()
}
return r %% 7 + 1
}
This is essentially one iteration of method 1. If we generate all possible combinations, here is resulting counts:
table(apply(expand.grid(1:5,(1:5)*2,(1:5)*3,(1:5)*4,(1:5)*5,(1:5)*6),1,sum) %% 7 + 1)
1 2 3 4 5 6 7
2233 2232 2232 2232 2232 2232 2232
One number will appear once more in 5^6 = 15625
trials.
Now, in Method 1, by adding 1 to 6, we move the number 2233 to each of the successive point. Thus the total number of combinations will match up. This works because 5^6 %% 7 = 1, and then we do 7 appropriate variations, so (7 * 5^6 %% 7 = 0).
If the argument of method 1 and 2 is understood, method 3 follows, and requires only 7 calls to rand5
. At this point, I feel this is the minimum number of calls needed for an exact solution.
Here is an R implementation:
rand5 <- function() sample(1:5,1)
rand7 <- function() (sum(sapply(1:7, function(i) i * rand5())) %% 7) + 1
For people not familiar with R, here is a simplified version:
rand7 = function(){
r = 0
for(i in 1:7){
r = r + i * rand5()
}
return r %% 7 + 1
}
The distribution of rand5
will be preserved. If we do the math, each of the 7 iterations of the loop has 5 possible outcomes, thus total number of possible combinations are (7 * 5) %% 7 = 0
. Thus we can divide the random numbers generated in equal groups of 7. See method one and two for more discussion on this.
Here are all the possible combinations:
table(apply(expand.grid(0:6,(1:5)),1,sum) %% 7 + 1)
1 2 3 4 5 6 7
5 5 5 5 5 5 5
I think it's straight forward to show that Adam's method will still run faster. The probability that there are 7 or more calls to rand5
in Adam's solution is still small ((4/25)^3 ~ 0.004
).
This is a minor variation of the the second method. It is almost uniform, but requires 7 calls to rand5
, that is one additional to method 2:
rand7 <- function() (rand5() + sum(sapply(1:6,function(i) i*rand5())) %% 7) + 1
Here is a simplified version:
rand7 = function(){
r = 0
for(i in 1:6){
r = r + i*rand5()
}
return (r+rand5()) %% 7 + 1
}
If we generate all possible combinations, here is resulting counts:
table(apply(expand.grid(1:5,(1:5)*2,(1:5)*3,(1:5)*4,(1:5)*5,(1:5)*6,1:5),1,sum) %% 7 + 1)
1 2 3 4 5 6 7
11160 11161 11161 11161 11161 11161 11160
Two numbers will appear once less in 5^7 = 78125
trials. For most purposes, I can live with that.
rand7() = (rand5()+rand5()+rand5()+rand5()+rand5()+rand5()+rand5())%7+1
Edit: That doesn't quite work. It's off by about 2 parts in 1000 (assuming a perfect rand5). The buckets get:
value Count Error%
1 11158 -0.0035
2 11144 -0.0214
3 11144 -0.0214
4 11158 -0.0035
5 11172 +0.0144
6 11177 +0.0208
7 11172 +0.0144
By switching to a sum of
n Error%
10 +/- 1e-3,
12 +/- 1e-4,
14 +/- 1e-5,
16 +/- 1e-6,
...
28 +/- 3e-11
seems to gain an order of magnitude for every 2 added
BTW: the table of errors above was not generated via sampling but by the following recurrence relation:
p[x,n]
is the number waysoutput=x
can happen givenn
calls torand5
.
p[1,1] ... p[5,1] = 1
p[6,1] ... p[7,1] = 0
p[1,n] = p[7,n-1] + p[6,n-1] + p[5,n-1] + p[4,n-1] + p[3,n-1]
p[2,n] = p[1,n-1] + p[7,n-1] + p[6,n-1] + p[5,n-1] + p[4,n-1]
p[3,n] = p[2,n-1] + p[1,n-1] + p[7,n-1] + p[6,n-1] + p[5,n-1]
p[4,n] = p[3,n-1] + p[2,n-1] + p[1,n-1] + p[7,n-1] + p[6,n-1]
p[5,n] = p[4,n-1] + p[3,n-1] + p[2,n-1] + p[1,n-1] + p[7,n-1]
p[6,n] = p[5,n-1] + p[4,n-1] + p[3,n-1] + p[2,n-1] + p[1,n-1]
p[7,n] = p[6,n-1] + p[5,n-1] + p[4,n-1] + p[3,n-1] + p[2,n-1]
(I have stolen Adam Rosenfeld's answer and made it run about 7% faster.)
Assume that rand5() returns one of {0,1,2,3,4} with equal distribution and the goal is return {0,1,2,3,4,5,6} with equal distribution.
int rand7() {
i = 5 * rand5() + rand5();
max = 25;
//i is uniform among {0 ... max-1}
while(i < max%7) {
//i is uniform among {0 ... (max%7 - 1)}
i *= 5;
i += rand5(); //i is uniform {0 ... (((max%7)*5) - 1)}
max %= 7;
max *= 5; //once again, i is uniform among {0 ... max-1}
}
return(i%7);
}
We're keeping track of the largest value that the loop can make in the variable max
. If the reult so far is between max%7 and max-1 then the result will be uniformly distrubuted in that range. If not, we use the remainder, which is random between 0 and max%7-1, and another call to rand() to make a new number and a new max. Then we start again.
Edit: Expect number of times to call rand5() is x in this equation:
x = 2 * 21/25
+ 3 * 4/25 * 14/20
+ 4 * 4/25 * 6/20 * 28/30
+ 5 * 4/25 * 6/20 * 2/30 * 7/10
+ 6 * 4/25 * 6/20 * 2/30 * 3/10 * 14/15
+ (6+x) * 4/25 * 6/20 * 2/30 * 3/10 * 1/15
x = about 2.21 calls to rand5()
This is equivalent to Adam Rosenfield's solution, but may be a bit more clear for some readers. It assumes rand5() is a function that returns a statistically random integer in the range 1 through 5 inclusive.
int rand7()
{
int vals[5][5] = {
{ 1, 2, 3, 4, 5 },
{ 6, 7, 1, 2, 3 },
{ 4, 5, 6, 7, 1 },
{ 2, 3, 4, 5, 6 },
{ 7, 0, 0, 0, 0 }
};
int result = 0;
while (result == 0)
{
int i = rand5();
int j = rand5();
result = vals[i-1][j-1];
}
return result;
}
How does it work? Think of it like this: imagine printing out this double-dimension array on paper, tacking it up to a dart board and randomly throwing darts at it. If you hit a non-zero value, it's a statistically random value between 1 and 7, since there are an equal number of non-zero values to choose from. If you hit a zero, just keep throwing the dart until you hit a non-zero. That's what this code is doing: the i and j indexes randomly select a location on the dart board, and if we don't get a good result, we keep throwing darts.
Like Adam said, this can run forever in the worst case, but statistically the worst case never happens. :)
There you go, uniform distribution and zero rand5 calls.
def rand7:
seed += 1
if seed >= 7:
seed = 0
yield seed
Need to set seed beforehand.