Generate random numbers with a given (numerical) distribution

前端 未结 13 2008
我寻月下人不归
我寻月下人不归 2020-11-22 11:18

I have a file with some probabilities for different values e.g.:

1 0.1
2 0.05
3 0.05
4 0.2
5 0.4
6 0.2

I would like to generate random numb

相关标签:
13条回答
  • 2020-11-22 11:51
    from __future__ import division
    import random
    from collections import Counter
    
    
    def num_gen(num_probs):
        # calculate minimum probability to normalize
        min_prob = min(prob for num, prob in num_probs)
        lst = []
        for num, prob in num_probs:
            # keep appending num to lst, proportional to its probability in the distribution
            for _ in range(int(prob/min_prob)):
                lst.append(num)
        # all elems in lst occur proportional to their distribution probablities
        while True:
            # pick a random index from lst
            ind = random.randint(0, len(lst)-1)
            yield lst[ind]
    

    Verification:

    gen = num_gen([(1, 0.1),
                   (2, 0.05),
                   (3, 0.05),
                   (4, 0.2),
                   (5, 0.4),
                   (6, 0.2)])
    lst = []
    times = 10000
    for _ in range(times):
        lst.append(next(gen))
    # Verify the created distribution:
    for item, count in Counter(lst).iteritems():
        print '%d has %f probability' % (item, count/times)
    
    1 has 0.099737 probability
    2 has 0.050022 probability
    3 has 0.049996 probability 
    4 has 0.200154 probability
    5 has 0.399791 probability
    6 has 0.200300 probability
    
    0 讨论(0)
  • 2020-11-22 11:54

    I wrote a solution for drawing random samples from a custom continuous distribution.

    I needed this for a similar use-case to yours (i.e. generating random dates with a given probability distribution).

    You just need the funtion random_custDist and the line samples=random_custDist(x0,x1,custDist=custDist,size=1000). The rest is decoration ^^.

    import numpy as np
    
    #funtion
    def random_custDist(x0,x1,custDist,size=None, nControl=10**6):
        #genearte a list of size random samples, obeying the distribution custDist
        #suggests random samples between x0 and x1 and accepts the suggestion with probability custDist(x)
        #custDist noes not need to be normalized. Add this condition to increase performance. 
        #Best performance for max_{x in [x0,x1]} custDist(x) = 1
        samples=[]
        nLoop=0
        while len(samples)<size and nLoop<nControl:
            x=np.random.uniform(low=x0,high=x1)
            prop=custDist(x)
            assert prop>=0 and prop<=1
            if np.random.uniform(low=0,high=1) <=prop:
                samples += [x]
            nLoop+=1
        return samples
    
    #call
    x0=2007
    x1=2019
    def custDist(x):
        if x<2010:
            return .3
        else:
            return (np.exp(x-2008)-1)/(np.exp(2019-2007)-1)
    samples=random_custDist(x0,x1,custDist=custDist,size=1000)
    print(samples)
    
    #plot
    import matplotlib.pyplot as plt
    #hist
    bins=np.linspace(x0,x1,int(x1-x0+1))
    hist=np.histogram(samples, bins )[0]
    hist=hist/np.sum(hist)
    plt.bar( (bins[:-1]+bins[1:])/2, hist, width=.96, label='sample distribution')
    #dist
    grid=np.linspace(x0,x1,100)
    discCustDist=np.array([custDist(x) for x in grid]) #distrete version
    discCustDist*=1/(grid[1]-grid[0])/np.sum(discCustDist)
    plt.plot(grid,discCustDist,label='custom distribustion (custDist)', color='C1', linewidth=4)
    #decoration
    plt.legend(loc=3,bbox_to_anchor=(1,0))
    plt.show()
    

    The performance of this solution is improvable for sure, but I prefer readability.

    0 讨论(0)
  • 2020-11-22 11:54

    based on other solutions, you generate accumulative distribution (as integer or float whatever you like), then you can use bisect to make it fast

    this is a simple example (I used integers here)

    l=[(20, 'foo'), (60, 'banana'), (10, 'monkey'), (10, 'monkey2')]
    def get_cdf(l):
        ret=[]
        c=0
        for i in l: c+=i[0]; ret.append((c, i[1]))
        return ret
    
    def get_random_item(cdf):
        return cdf[bisect.bisect_left(cdf, (random.randint(0, cdf[-1][0]),))][1]
    
    cdf=get_cdf(l)
    for i in range(100): print get_random_item(cdf),
    

    the get_cdf function would convert it from 20, 60, 10, 10 into 20, 20+60, 20+60+10, 20+60+10+10

    now we pick a random number up to 20+60+10+10 using random.randint then we use bisect to get the actual value in a fast way

    0 讨论(0)
  • 2020-11-22 11:55

    you might want to have a look at NumPy Random sampling distributions

    0 讨论(0)
  • 2020-11-22 11:55

    Here is a more effective way of doing this:

    Just call the following function with your 'weights' array (assuming the indices as the corresponding items) and the no. of samples needed. This function can be easily modified to handle ordered pair.

    Returns indexes (or items) sampled/picked (with replacement) using their respective probabilities:

    def resample(weights, n):
        beta = 0
    
        # Caveat: Assign max weight to max*2 for best results
        max_w = max(weights)*2
    
        # Pick an item uniformly at random, to start with
        current_item = random.randint(0,n-1)
        result = []
    
        for i in range(n):
            beta += random.uniform(0,max_w)
    
            while weights[current_item] < beta:
                beta -= weights[current_item]
                current_item = (current_item + 1) % n   # cyclic
            else:
                result.append(current_item)
        return result
    

    A short note on the concept used in the while loop. We reduce the current item's weight from cumulative beta, which is a cumulative value constructed uniformly at random, and increment current index in order to find the item, the weight of which matches the value of beta.

    0 讨论(0)
  • Since Python 3.6, there's a solution for this in Python's standard library, namely random.choices.

    Example usage: let's set up a population and weights matching those in the OP's question:

    >>> from random import choices
    >>> population = [1, 2, 3, 4, 5, 6]
    >>> weights = [0.1, 0.05, 0.05, 0.2, 0.4, 0.2]
    

    Now choices(population, weights) generates a single sample:

    >>> choices(population, weights)
    4
    

    The optional keyword-only argument k allows one to request more than one sample at once. This is valuable because there's some preparatory work that random.choices has to do every time it's called, prior to generating any samples; by generating many samples at once, we only have to do that preparatory work once. Here we generate a million samples, and use collections.Counter to check that the distribution we get roughly matches the weights we gave.

    >>> million_samples = choices(population, weights, k=10**6)
    >>> from collections import Counter
    >>> Counter(million_samples)
    Counter({5: 399616, 6: 200387, 4: 200117, 1: 99636, 3: 50219, 2: 50025})
    
    0 讨论(0)
提交回复
热议问题