How to do weighted random sample of categories in python

Given a list of tuples where each tuple consists of a probability and an item I'd like to sample an item according to its probability. For example, give the list [ (.3, 'a'), (.4, 'b'), (.3, 'c')] I'd like to sample 'b' 40% of the time.

What's the canonical way of doing this in python?

I've looked at the random module which doesn't seem to have an appropriate function and at numpy.random which although it has a multinomial function doesn't seem to return the results in a nice form for this problem. I'm basically looking for something like mnrnd in matlab.

Many thanks.

Thanks for all the answers so quickly. To clarify, I'm not looking for explanations of how to write a sampling scheme, but rather to be pointed to an easy way to sample from a multinomial distribution given a set of objects and weights, or to be told that no such function exists in a standard library and so one should write one's own.

import numpy

n = 1000
pairs = [(.3, 'a'), (.3, 'b'), (.4, 'c')]
probabilities = numpy.random.multinomial(n, zip(*pairs)[0])
result = zip(probabilities, zip(*pairs)[1])
# [(299, 'a'), (299, 'b'), (402, 'c')]
[x[0] * x[1] for x in result]
# ['aaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbb', 'cccccccccccccccccccc']

How exactly would you like to receive the results?

sholte

This might do what you want:

numpy.array([.3,.4,.3]).cumsum().searchsorted(numpy.random.sample(5))

Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:

numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])

There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.

For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):

But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.

#!/usr/bin/python3
def randomCategory(probDict):
    """
        >>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}

        >>> [randomCategory(dist) for _ in range(5)]
        ['c', 'c', 'a', 'd', 'c']

        >>> Counter(randomCategory(dist) for _ in range(10**5))
        Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
    """
    r = random.random() # range: [0,1)
    total = 0           # range: [0,1]
    for value,prob in probDict.items():
        total += prob
        if total>r:
            return value
    raise Exception('distribution not normalized: {probs}'.format(probs=probDict))

One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert if prob==0: continue.

For the record, here's the hackish way to do it:

import random

def makeSampler(probDict):
    """
        >>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
        >>> sampler.sample()
        'a'
        >>> sampler.sample()
        'c'
    """
    oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
    def sampler():
        return random.choice(oneHundredElements)
    return sampler

However if you don't have resolution issues... this is actually probably the fastest way possible. =)

Howabout creating 3 "a", 4 "b" and 3 "c" in a list an then just randomly select one. With enough iterations you will get the desired probability.

Dunes

I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way

import numpy
from itertools import izip

def getSamples(input, size):
    probabilities, items = zip(*input)
    sampleCounts = numpy.random.multinomial(size, probabilities)
    samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
    numpy.random.shuffle(samples)
    return samples

def countsToSamples(counts, items):
    for value, repeats in izip(items, counts):
        for _i in xrange(repeats):
            yield value

Where inputs is as specified [(.2, 'a'), (.4, 'b'), (.3, 'c')] and size is the number of samples you need.

I'm not sure if this is the pythonic way of doing what you ask, but you could use random.sample(['a','a','a','b','b','b','b','c','c','c'],k) where k is the number of samples you want.

For a more robust method, bisect the unit interval into sections based on the cumulative probability and draw from the uniform distribution (0,1) using random.random(). In this case the subintervals would be (0,.3)(.3,.7)(.7,1). You choose the element based on which subinterval it falls into.

Just inspired of sholte's very straightforward (and correct) answer: I'll just demonstrate how easy it will be to extend it to handle arbitrary items, like:

In []: s= array([.3, .4, .3]).cumsum().searchsorted(sample(54))
In []: c, _= histogram(s, bins= arange(4))
In []: [item* c[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccccccc']

Update:
Based on the feedback of phant0m, it turns out that an even more straightforward solution can be implemented based on multinomial, like:

In []: s= multinomial(54, [.3, .4, .3])
In []: [item* s[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccc']

IMHO here we have a nice summary of empirical cdf and multinomial based sampling yielding similar results. So, in a summary, pick it up one which suits best for your purposes.

This may be of marginal benefit but I did it this way:

import scipy.stats as sps
N=1000
M3 = sps.multinomial.rvs(1, p = [0.3,0.4,0.3], size=N, random_state=None)
M3a = [ np.where(r==1)[0][0] for r in M3 ] # convert 1-hot encoding to integers

This is similar to @eat's answer.

来源：https://stackoverflow.com/questions/6432499/how-to-do-weighted-random-sample-of-categories-in-python

标签

python

statistics

numpy

probability

random-sample