I would like to generate n random numbers e.g., n=200
, where the range of possible values is between 2 and 40 with a mean of 12 and median is 6.5.
I searche
Ok, you're looking at the distribution which has no less than 4 parameters - two of those defining range and two responsible for required mean and median.
I could think about two possibilities from the top of my head:
Truncated normal distribution, look here for details. You have already range defined, and have to recover μ and σ from mean and median. It will require solving couple of nonlinear equation, but quite doable in python. Sampling could be done using https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html
4-parameters Beta distribution, see here for details. Again, recovering α and β in Beta distribution from mean and median will require solving couple of non-linear equations. Knowing them sampling would be easy via https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.beta.html
UPDATE
Here how you could do it for truncated normal going from mean to mu: Truncated normal with a given mean
If you have a bunch of smaller arrays with the right median and mean, you can combine them to produce a larger array.
So... you can pre-generate smaller arrays as you are currently doing, and then combine them randomly for larger n. Of course, this will result in a biased random sample, but it sounds like you just want something that's approximately random.
Here's working (py3) code that generates a sample of size 5000 with your desired properties, which it build from smaller samples of size 4, 6, 8, 10, ..., 18.
Note, that I changed how the smaller random samples are built: half of the numbers must be <= 6 and half >= 7 if the median is to be 6.5, so we generate those halves independently. This speeds things up massively.
import collections
import numpy as np
import random
rs = collections.defaultdict(list)
for i in range(50):
n = random.randrange(4, 20, 2)
while True:
x=np.append(np.random.randint(2, 7, size=n//2), np.random.randint(7, 41, size=n//2))
if x.mean() == 12 and np.median(x) == 6.5:
break
rs[len(x)].append(x)
def random_range(n):
if n % 2:
raise AssertionError("%d must be even" % n)
r = []
while n:
i = random.randrange(4, min(20, n+1), 2)
# Don't be left with only 2 slots left.
if n - i == 2: continue
xs = random.choice(rs[i])
r.extend(xs)
n -= i
random.shuffle(r)
return r
xs = np.array(random_range(5000))
print([(i, list(xs).count(i)) for i in range(2, 41)])
print(len(xs))
print(xs.mean())
print(np.median(xs))
Output:
[(2, 620), (3, 525), (4, 440), (5, 512), (6, 403), (7, 345), (8, 126), (9, 111), (10, 78), (11, 25), (12, 48), (13, 61), (14, 117), (15, 61), (16, 62), (17, 116), (18, 49), (19, 73), (20, 88), (21, 48), (22, 68), (23, 46), (24, 75), (25, 77), (26, 49), (27, 83), (28, 61), (29, 28), (30, 59), (31, 73), (32, 51), (33, 113), (34, 72), (35, 33), (36, 51), (37, 44), (38, 25), (39, 38), (40, 46)]
5000
12.0
6.5
The first line of the output shows that there's 620 2's, 52 3's, 440 4's etc. in the final array.
One way to get a result really close to what you want is to generate two separate random ranges with length 100 that satisfies your median constraints and includes all the desire range of numbers. Then by concatenating the arrays the mean will be around 12 but not quite equal to 12. But since it's just mean that you're dealing with you can simply generate your expected result by tweaking one of these arrays.
In [162]: arr1 = np.random.randint(2, 7, 100)
In [163]: arr2 = np.random.randint(7, 40, 100)
In [164]: np.mean(np.concatenate((arr1, arr2)))
Out[164]: 12.22
In [166]: np.median(np.concatenate((arr1, arr2)))
Out[166]: 6.5
Following is a vectorized and very much optimized solution against any other solution that uses for loops or python-level code by constraining the random sequence creation:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99)
arr2 = np.random.randint(7, 40, 99)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
decm, intg = math.modf(i)
args = np.argsort(arr2)
arr2[args[-41:-1]] -= int(intg)
arr2[args[-1]] -= int(np.round(decm * 40))
return np.concatenate((arr1, mid, arr2))
Demo:
arr = gen_random()
print(np.median(arr))
print(arr.mean())
6.5
12.0
The logic behind the function:
In order for us to have a random array with that criteria we can concatenate 3 arrays together arr1
, mid
and arr2
. arr1
and arr2
each hold 99 items and the mid
holds 2 items 6 and 7 so that make the final result to give as 6.5 as the median. Now we an create two random arrays each with length 99. All we need to do to make the result to have a 12 mean is to find the difference between the current sum and 12 * 200
and subtract the result from our N largest numbers which in this case we can choose them from arr2
and use N=50
.
Edit:
If it's not a problem to have float numbers in your result you can actually shorten the function as following:
import numpy as np
import math
def gen_random():
arr1 = np.random.randint(2, 7, 99).astype(np.float)
arr2 = np.random.randint(7, 40, 99).astype(np.float)
mid = [6, 7]
i = ((np.sum(arr1 + arr2) + 13) - (12 * 200)) / 40
args = np.argsort(arr2)
arr2[args[-40:]] -= i
return np.concatenate((arr1, mid, arr2))
While this post already has an accepted answer, I'd like to contribute a general non integer approach. It does not need loops or testing. The idea is to take a PDF with compact support. Taking the idea of the accepted answer of Kasrâmvd, make two distributions in the left and right interval. Chose shape parameters such that the mean falls to the given value. The interesting opportunity here is that one can create a continuous PDF, i.e. without jumps where the intervals join.
As an example I have chosen the beta distribution. To have finite non-zero values at the border I've chosen beta =1 for the left and alpha = 1 for the right. Looking at the definition of the PDF and the requirement of the mean the continuity gives two equations:
4.5 / alpha = 33.5 / beta
2 + 6.5 * alpha / ( alpha + 1 ) + 6.5 + 33.5 * 1 / ( 1 + beta ) = 24
This is a quadratic equation rather easy to solve. The just using scipy.stat.beta
like
from scipy.stats import beta
import matplotlib.pyplot as plt
import numpy as np
x1 = np.linspace(2, 6.5, 200 )
x2 = np.linspace(6.5, 40, 200 )
# i use s and t not alpha and beta
s = 1./737 *(np.sqrt(294118) - 418 )
t = 1./99 *(np.sqrt(294118) - 418 )
data1 = beta.rvs(s, 1, loc=2, scale=4.5, size=20000)
data2 = beta.rvs(1, t, loc=6.5, scale=33.5, size=20000)
data = np.concatenate( ( data1, data2 ) )
print np.mean( data1 ), 2 + 4.5 * s/(1.+s)
print np.mean( data2 ), 6.5 + 33.5/(1.+t)
print np.mean( data )
print np.median( data )
fig = plt.figure()
ax = fig.add_subplot( 1, 1, 1 )
ax.hist(data1, bins=13, density=True )
ax.hist(data2, bins=67, density=True )
ax.plot( x1, beta.pdf( x1, s, 1, loc=2, scale=4.5 ) )
ax.plot( x2, beta.pdf( x2, 1, t, loc=6.5, scale=33.5 ) )
ax.set_yscale( 'log' )
plt.show()
provides
>> 2.661366939244768 2.6495436216856976
>> 21.297348804473618 21.3504563783143
>> 11.979357871859191
>> 6.5006779033245135
so results are as required and it looks like:
Here, you want a median value lesser than the mean value. That means that a uniform distribution is not appropriate: you want many little values and fewer great ones.
Specifically, you want as many value lesser or equal to 6 as the number of values greater or equal to 7.
A simple way to ensure that the median will be 6.5 is to have the same number of values in the range [ 2 - 6 ] as in [ 7 - 40 ]. If you choosed uniform distributions in both ranges, you would have a theorical mean of 13.75, which is not that far from the required 12.
A slight variation on the weights can make the theorical mean even closer: if we use [ 5, 4, 3, 2, 1, 1, ..., 1 ] for the relative weights of the random.choices
of the [ 7, 8, ..., 40 ] range, we find a theorical mean of 19.98 for that range, which is close enough to the expected 20.
Example code:
>>> pop1 = list(range(2, 7))
>>> pop2 = list(range(7, 41))
>>> w2 = [ 5, 4, 3, 2 ] + ( [1] * 30)
>>> r1 = random.choices(pop1, k=2500)
>>> r2 = random.choices(pop2, w2, k=2500)
>>> r = r1 + r2
>>> random.shuffle(r)
>>> statistics.mean(r)
12.0358
>>> statistics.median(r)
6.5
>>>
So we now have a 5000 values distribution that has a median of exactly 6.5 and a mean value of 12.0358 (this one is random, and another test will give a slightly different value). If we want an exact mean of 12, we just have to tweak some values. Here sum(r)
is 60179 when it should be 60000, so we have to decrease 175 values which were neither 2 (would go out of range) not 7 (would change the median).
In the end, a possible generator function could be:
def gendistrib(n):
if n % 2 != 0 :
raise ValueError("gendistrib needs an even parameter")
n2 = n//2 # n / 2 in Python 2
pop1 = list(range(2, 7)) # lower range
pop2 = list(range(7, 41)) # upper range
w2 = [ 5, 4, 3, 2 ] + ( [1] * 30) # weights for upper range
r1 = random.choices(pop1, k=n2) # lower part of the distrib.
r2 = random.choices(pop2, w2, k=n2) # upper part
r = r1 + r2
random.shuffle(r) # randomize order
# time to force an exact mean
tot = sum(r)
expected = 12 * n
if tot > expected: # too high: decrease some values
for i, val in enumerate(r):
if val != 2 and val != 7:
r[i] = val - 1
tot -= 1
if tot == expected:
random.shuffle(r) # shuffle again the decreased values
break
elif tot < expected: # too low: increase some values
for i, val in enumerate(r):
if val != 6 and val != 40:
r[i] = val + 1
tot += 1
if tot == expected:
random.shuffle(r) # shuffle again the increased values
break
return r
It is really fast: I could timeit gendistrib(10000)
at less than 0.02 seconds. But it should not be used for small distributions (less than 1000)