Given a list of tuples where each tuple consists of a probability and an item I\'d like to sample an item according to its probability. For example, give the list [ (.3, \'a\'),
There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.
For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):
But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.
#!/usr/bin/python3
def randomCategory(probDict):
"""
>>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}
>>> [randomCategory(dist) for _ in range(5)]
['c', 'c', 'a', 'd', 'c']
>>> Counter(randomCategory(dist) for _ in range(10**5))
Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
"""
r = random.random() # range: [0,1)
total = 0 # range: [0,1]
for value,prob in probDict.items():
total += prob
if total>r:
return value
raise Exception('distribution not normalized: {probs}'.format(probs=probDict))
One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert if prob==0: continue
.
For the record, here's the hackish way to do it:
import random
def makeSampler(probDict):
"""
>>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
>>> sampler.sample()
'a'
>>> sampler.sample()
'c'
"""
oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
def sampler():
return random.choice(oneHundredElements)
return sampler
However if you don't have resolution issues... this is actually probably the fastest way possible. =)
Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:
numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])
I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way
import numpy
from itertools import izip
def getSamples(input, size):
probabilities, items = zip(*input)
sampleCounts = numpy.random.multinomial(size, probabilities)
samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
numpy.random.shuffle(samples)
return samples
def countsToSamples(counts, items):
for value, repeats in izip(items, counts):
for _i in xrange(repeats):
yield value
Where inputs is as specified [(.2, 'a'), (.4, 'b'), (.3, 'c')]
and size is the number of samples you need.