Generate random numbers with a given (numerical) distribution

前端 未结 13 2006
我寻月下人不归
我寻月下人不归 2020-11-22 11:18

I have a file with some probabilities for different values e.g.:

1 0.1
2 0.05
3 0.05
4 0.2
5 0.4
6 0.2

I would like to generate random numb

相关标签:
13条回答
  • 2020-11-22 11:37

    An advantage to generating the list using CDF is that you can use binary search. While you need O(n) time and space for preprocessing, you can get k numbers in O(k log n). Since normal Python lists are inefficient, you can use array module.

    If you insist on constant space, you can do the following; O(n) time, O(1) space.

    def random_distr(l):
        r = random.uniform(0, 1)
        s = 0
        for item, prob in l:
            s += prob
            if s >= r:
                return item
        return item  # Might occur because of floating point inaccuracies
    
    0 讨论(0)
  • 2020-11-22 11:37

    None of these answers is particularly clear or simple.

    Here is a clear, simple method that is guaranteed to work.

    accumulate_normalize_probabilities takes a dictionary p that maps symbols to probabilities OR frequencies. It outputs usable list of tuples from which to do selection.

    def accumulate_normalize_values(p):
            pi = p.items() if isinstance(p,dict) else p
            accum_pi = []
            accum = 0
            for i in pi:
                    accum_pi.append((i[0],i[1]+accum))
                    accum += i[1]
            if accum == 0:
                    raise Exception( "You are about to explode the universe. Continue ? Y/N " )
            normed_a = []
            for a in accum_pi:
                    normed_a.append((a[0],a[1]*1.0/accum))
            return normed_a
    

    Yields:

    >>> accumulate_normalize_values( { 'a': 100, 'b' : 300, 'c' : 400, 'd' : 200  } )
    [('a', 0.1), ('c', 0.5), ('b', 0.8), ('d', 1.0)]
    

    Why it works

    The accumulation step turns each symbol into an interval between itself and the previous symbols probability or frequency (or 0 in the case of the first symbol). These intervals can be used to select from (and thus sample the provided distribution) by simply stepping through the list until the random number in interval 0.0 -> 1.0 (prepared earlier) is less or equal to the current symbol's interval end-point.

    The normalization releases us from the need to make sure everything sums to some value. After normalization the "vector" of probabilities sums to 1.0.

    The rest of the code for selection and generating a arbitrarily long sample from the distribution is below :

    def select(symbol_intervals,random):
            print symbol_intervals,random
            i = 0
            while random > symbol_intervals[i][1]:
                    i += 1
                    if i >= len(symbol_intervals):
                            raise Exception( "What did you DO to that poor list?" )
            return symbol_intervals[i][0]
    
    
    def gen_random(alphabet,length,probabilities=None):
            from random import random
            from itertools import repeat
            if probabilities is None:
                    probabilities = dict(zip(alphabet,repeat(1.0)))
            elif len(probabilities) > 0 and isinstance(probabilities[0],(int,long,float)):
                    probabilities = dict(zip(alphabet,probabilities)) #ordered
            usable_probabilities = accumulate_normalize_values(probabilities)
            gen = []
            while len(gen) < length:
                    gen.append(select(usable_probabilities,random()))
            return gen
    

    Usage :

    >>> gen_random (['a','b','c','d'],10,[100,300,400,200])
    ['d', 'b', 'b', 'a', 'c', 'c', 'b', 'c', 'c', 'c']   #<--- some of the time
    
    0 讨论(0)
  • 2020-11-22 11:41

    (OK, I know you are asking for shrink-wrap, but maybe those home-grown solutions just weren't succinct enough for your liking. :-)

    pdf = [(1, 0.1), (2, 0.05), (3, 0.05), (4, 0.2), (5, 0.4), (6, 0.2)]
    cdf = [(i, sum(p for j,p in pdf if j < i)) for i,_ in pdf]
    R = max(i for r in [random.random()] for i,c in cdf if c <= r)
    

    I pseudo-confirmed that this works by eyeballing the output of this expression:

    sorted(max(i for r in [random.random()] for i,c in cdf if c <= r)
           for _ in range(1000))
    
    0 讨论(0)
  • 2020-11-22 11:45

    Make a list of items, based on their weights:

    items = [1, 2, 3, 4, 5, 6]
    probabilities= [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
    # if the list of probs is normalized (sum(probs) == 1), omit this part
    prob = sum(probabilities) # find sum of probs, to normalize them
    c = (1.0)/prob # a multiplier to make a list of normalized probs
    probabilities = map(lambda x: c*x, probabilities)
    print probabilities
    
    ml = max(probabilities, key=lambda x: len(str(x)) - str(x).find('.'))
    ml = len(str(ml)) - str(ml).find('.') -1
    amounts = [ int(x*(10**ml)) for x in probabilities]
    itemsList = list()
    for i in range(0, len(items)): # iterate through original items
      itemsList += items[i:i+1]*amounts[i]
    
    # choose from itemsList randomly
    print itemsList
    

    An optimization may be to normalize amounts by the greatest common divisor, to make the target list smaller.

    Also, this might be interesting.

    0 讨论(0)
  • 2020-11-22 11:46

    scipy.stats.rv_discrete might be what you want. You can supply your probabilities via the values parameter. You can then use the rvs() method of the distribution object to generate random numbers.

    As pointed out by Eugene Pakhomov in the comments, you can also pass a p keyword parameter to numpy.random.choice(), e.g.

    numpy.random.choice(numpy.arange(1, 7), p=[0.1, 0.05, 0.05, 0.2, 0.4, 0.2])
    

    If you are using Python 3.6 or above, you can use random.choices() from the standard library – see the answer by Mark Dickinson.

    0 讨论(0)
  • 2020-11-22 11:48

    Another answer, probably faster :)

    distribution = [(1, 0.2), (2, 0.3), (3, 0.5)]  
    # init distribution  
    dlist = []  
    sumchance = 0  
    for value, chance in distribution:  
        sumchance += chance  
        dlist.append((value, sumchance))  
    assert sumchance == 1.0 # not good assert because of float equality  
    
    # get random value  
    r = random.random()  
    # for small distributions use lineair search  
    if len(distribution) < 64: # don't know exact speed limit  
        for value, sumchance in dlist:  
            if r < sumchance:  
                return value  
    else:  
        # else (not implemented) binary search algorithm  
    
    0 讨论(0)
提交回复
热议问题