How to do weighted random sample of categories in python

后端 未结 9 2121

Given a list of tuples where each tuple consists of a probability and an item I\'d like to sample an item according to its probability. For example, give the list [ (.3, \'a\'),

相关标签:
9条回答
  • 2021-01-31 18:25

    There are hacks you can do if, for example, your probabilities fit nicely into percentages, etc.

    For example, if you're fine with percentages, the following will work (at the cost of a high memory overhead):

    But the "real" way to do it with arbitrary float probabilities is to sample from the cumulative distribution, after constructing it. This is equivalent to subdividing the unit interval [0,1] into 3 line segments labelled 'a','b', and 'c'; then picking a random point on the unit interval and seeing which line segment it it.

    #!/usr/bin/python3
    def randomCategory(probDict):
        """
            >>> dist = {'a':.1, 'b':.2, 'c':.3, 'd':.4}
    
            >>> [randomCategory(dist) for _ in range(5)]
            ['c', 'c', 'a', 'd', 'c']
    
            >>> Counter(randomCategory(dist) for _ in range(10**5))
            Counter({'d': 40127, 'c': 29975, 'b': 19873, 'a': 10025})
        """
        r = random.random() # range: [0,1)
        total = 0           # range: [0,1]
        for value,prob in probDict.items():
            total += prob
            if total>r:
                return value
        raise Exception('distribution not normalized: {probs}'.format(probs=probDict))
    

    One has to be careful of methods which return values even if their probability is 0. Fortunately this method does not, but just in case, one could insert if prob==0: continue.


    For the record, here's the hackish way to do it:

    import random
    
    def makeSampler(probDict):
        """
            >>> sampler = makeSampler({'a':0.3, 'b':0.4, 'c':0.3})
            >>> sampler.sample()
            'a'
            >>> sampler.sample()
            'c'
        """
        oneHundredElements = sum(([val]*(prob*100) for val,prob in probDict.items()), [])
        def sampler():
            return random.choice(oneHundredElements)
        return sampler
    

    However if you don't have resolution issues... this is actually probably the fastest way possible. =)

    0 讨论(0)
  • 2021-01-31 18:31

    Since nobody used the numpy.random.choice function, here's one that will generate what you need in a single, compact line:

    numpy.random.choice(['a','b','c'], size = 20, p = [0.3,0.4,0.3])
    
    0 讨论(0)
  • 2021-01-31 18:33

    I reckon the multinomial function is a still fairly easy way to get samples of a distribution in random order. This is just one way

    import numpy
    from itertools import izip
    
    def getSamples(input, size):
        probabilities, items = zip(*input)
        sampleCounts = numpy.random.multinomial(size, probabilities)
        samples = numpy.array(tuple(countsToSamples(sampleCounts, items)))
        numpy.random.shuffle(samples)
        return samples
    
    def countsToSamples(counts, items):
        for value, repeats in izip(items, counts):
            for _i in xrange(repeats):
                yield value
    

    Where inputs is as specified [(.2, 'a'), (.4, 'b'), (.3, 'c')] and size is the number of samples you need.

    0 讨论(0)
提交回复
热议问题