Probability distribution in Python

后端 未结 12 2014
误落风尘
误落风尘 2020-12-02 08:12

I have a bunch of keys that each have an unlikeliness variable. I want to randomly choose one of these keys, yet I want it to be more unlikely for unlikely (key, values) to

相关标签:
12条回答
  • 2020-12-02 08:30

    I suggest you port this PHP implementation of weighted random to Python. In particular, the binary-search-based second algorithm helps address your speed concerns.

    0 讨论(0)
  • 2020-12-02 08:30

    You want to give each object a weight. The bigger the weight the more likely it will happen. More precisely probx =weight/sum_all_weights.

    Then generate a random number in the range 0 to sum_all_weights and map it to each object.

    This code allows you to generate a random index and it is mapped when the object is created for speed. If all of your sets of objects have the same distribution then you can get by with only one RandomIndex object.

    import random
    
    class RandomIndex:
        def __init__(self, wlist):
            self._wi=[]
            self._rsize=sum(wlist)-1
            self._m={}
            i=0
            s=wlist[i]
            for n in range(self._rsize+1):
                if n == s:
                    i+=1
                    s+=wlist[i]
                self._m[n]=i    
    
        def i(self):
            rn=random.randint(0,self._rsize)
            return self._m[rn]
    
    
    sx=[1,2,3,4]
    
    
    wx=[1,10,100,1000] #weight list
    ri=RandomIndex(wx)
    
    cnt=[0,0,0,0]
    
    for i in range(1000):
        cnt[ri.i()] +=1  #keep track of number of times each index was generated
    
    print(cnt)  
    
    0 讨论(0)
  • 2020-12-02 08:31

    This activestate recipe gives an easy-to-follow approach, specifically the version in the comments that doesn't require you to pre-normalize your weights:

    import random
    
    def weighted_choice(items):
        """items is a list of tuples in the form (item, weight)"""
        weight_total = sum((item[1] for item in items))
        n = random.uniform(0, weight_total)
        for item, weight in items:
            if n < weight:
                return item
            n = n - weight
        return item
    

    This will be slow if you have a large list of items. A binary search would probably be better in that case... but would also be more complicated to write, for little gain if you have a small sample size. Here's an example of the binary search approach in python if you want to follow that route.

    (I'd recommend doing some quick performance testing of both methods on your dataset. The performance of different approaches to this sort of algorithm is often a bit unintuitive.)


    Edit: I took my own advice, since I was curious, and did a few tests.

    I compared four approaches:

    The weighted_choice function above.

    A binary-search choice function like so:

    def weighted_choice_bisect(items):
        added_weights = []
        last_sum = 0
    
        for item, weight in items:
            last_sum += weight
            added_weights.append(last_sum)
    
        return items[bisect.bisect(added_weights, random.random() * last_sum)][0]
    

    A compiling version of 1:

    def weighted_choice_compile(items):
        """returns a function that fetches a random item from items
    
        items is a list of tuples in the form (item, weight)"""
        weight_total = sum((item[1] for item in items))
        def choice(uniform = random.uniform):
            n = uniform(0, weight_total)
            for item, weight in items:
                if n < weight:
                    return item
                n = n - weight
            return item
        return choice
    

    A compiling version of 2:

    def weighted_choice_bisect_compile(items):
        """Returns a function that makes a weighted random choice from items."""
        added_weights = []
        last_sum = 0
    
        for item, weight in items:
            last_sum += weight
            added_weights.append(last_sum)
    
        def choice(rnd=random.random, bis=bisect.bisect):
            return items[bis(added_weights, rnd() * last_sum)][0]
        return choice
    

    I then built a big list of choices like so:

    choices = [(random.choice("abcdefg"), random.uniform(0,50)) for i in xrange(2500)]
    

    And an excessively simple profiling function:

    def profiler(f, n, *args, **kwargs):
        start = time.time()
        for i in xrange(n):
            f(*args, **kwargs)
        return time.time() - start
    

    The results:

    (Seconds taken for 1,000 calls to the function.)

    • Simple uncompiled: 0.918624162674
    • Binary uncompiled: 1.01497793198
    • Simple compiled: 0.287325024605
    • Binary compiled: 0.00327413797379

    The "compiled" results include the average time taken to compile the choice function once. (I timed 1,000 compiles, then divided that time by 1,000, and added the result to the choice function time.)

    So: if you have a list of items+weights which change very rarely, the binary compiled method is by far the fastest.

    0 讨论(0)
  • 2020-12-02 08:36

    In comments on the original post, Nicholas Leonard suggests that both the exchanging and the sampling need to be fast. Here's an idea for that case; I haven't tried it.

    If only sampling had to be fast, we could use an array of the values together with the running sum of their probabilities, and do a binary search on the running sum (with key being a uniform random number) -- an O(log(n)) operation. But an exchange would require updating all of the running-sum values appearing after the entries exchanged -- an O(n) operation. (Could you choose to exchange only items near the end of their lists? I'll assume not.)

    So let's aim for O(log(n)) in both operations. Instead of an array, keep a binary tree for each set to sample from. A leaf holds the sample value and its (unnormalized) probability. A branch node holds the total probability of its children.

    To sample, generate a uniform random number x between 0 and the total probability of the root, and descend the tree. At each branch, choose the left child if the left child has total probability <= x. Else subtract the left child's probability from x and go right. Return the leaf value you reach.

    To exchange, remove the leaf from its tree and adjust the branches that lead down to it (decreasing their total probability, and cutting out any single-child branch nodes). Insert the leaf into the destination tree: you have a choice of where to put it, so keep it balanced. Picking a random child at each level is probably good enough -- that's where I'd start. Increase each parent node's probability, back up to the root.

    Now both sampling and exchange are O(log(n)) on average. (If you need guaranteed balance, a simple way is to add another field to the branch nodes holding the count of leaves in the whole subtree. When adding a leaf, at each level pick the child with fewer leaves. This leaves the possibility of a tree getting unbalanced solely by deletions; this can't be a problem if there's reasonably even traffic between the sets, but if it is, then choose rotations during deletion using the leaf-count information on each node in your traversal.)

    Update: On request, here's a basic implementation. Haven't tuned it at all. Usage:

    >>> t1 = build_tree([('one', 20), ('two', 2), ('three', 50)])
    >>> t1
    Branch(Leaf(20, 'one'), Branch(Leaf(2, 'two'), Leaf(50, 'three')))
    >>> t1.sample()
    Leaf(50, 'three')
    >>> t1.sample()
    Leaf(20, 'one')
    >>> t2 = build_tree([('four', 10), ('five', 30)])
    >>> t1a, t2a = transfer(t1, t2)
    >>> t1a
    Branch(Leaf(20, 'one'), Leaf(2, 'two'))
    >>> t2a
    Branch(Leaf(10, 'four'), Branch(Leaf(30, 'five'), Leaf(50, 'three')))
    

    Code:

    import random
    
    def build_tree(pairs):
        tree = Empty()
        for value, weight in pairs:
            tree = tree.add(Leaf(weight, value))
        return tree
    
    def transfer(from_tree, to_tree):
        """Given a nonempty tree and a target, move a leaf from the former to
        the latter. Return the two updated trees."""
        leaf, from_tree1 = from_tree.extract()
        return from_tree1, to_tree.add(leaf)
    
    class Tree:
        def add(self, leaf):
            "Return a new tree holding my leaves plus the given leaf."
            abstract
        def sample(self):
            "Pick one of my leaves at random in proportion to its weight."
            return self.sampling(random.uniform(0, self.weight))
        def extract(self):
            """Pick one of my leaves and return it along with a new tree
            holding my leaves minus that one leaf."""
            return self.extracting(random.uniform(0, self.weight))        
    
    class Empty(Tree):
        weight = 0
        def __repr__(self):
            return 'Empty()'
        def add(self, leaf):
            return leaf
        def sampling(self, weight):
            raise Exception("You can't sample an empty tree")
        def extracting(self, weight):
            raise Exception("You can't extract from an empty tree")
    
    class Leaf(Tree):
        def __init__(self, weight, value):
            self.weight = weight
            self.value = value
        def __repr__(self):
            return 'Leaf(%r, %r)' % (self.weight, self.value)
        def add(self, leaf):
            return Branch(self, leaf)
        def sampling(self, weight):
            return self
        def extracting(self, weight):
            return self, Empty()
    
    def combine(left, right):
        if isinstance(left, Empty): return right
        if isinstance(right, Empty): return left
        return Branch(left, right)
    
    class Branch(Tree):
        def __init__(self, left, right):
            self.weight = left.weight + right.weight
            self.left = left
            self.right = right
        def __repr__(self):
            return 'Branch(%r, %r)' % (self.left, self.right)
        def add(self, leaf):
            # Adding to a random branch as a clumsy way to keep an
            # approximately balanced tree.
            if random.random() < 0.5:
                return combine(self.left.add(leaf), self.right)
            return combine(self.left, self.right.add(leaf))
        def sampling(self, weight):
            if weight < self.left.weight:
                return self.left.sampling(weight)
            return self.right.sampling(weight - self.left.weight)
        def extracting(self, weight):
            if weight < self.left.weight:
                leaf, left1 = self.left.extracting(weight)
                return leaf, combine(left1, self.right)
            leaf, right1 = self.right.extracting(weight - self.left.weight)
            return leaf, combine(self.left, right1)
    

    Update 2: In answering another problem, Jason Orendorff points out that the binary trees can be kept perfectly balanced by representing them in an array just like the classical heap structure. (This saves the space spent on pointers, too.) See my comments to that answer for how to adapt his code to this problem.

    0 讨论(0)
  • 2020-12-02 08:42

    Here is a classic way to do it, in pseudocode, where random.random() gives you a random float from 0 to 1.

    let z = sum of all the convictions
    let choice = random.random() * z 
    iterate through your objects:
        choice = choice - the current object's conviction
        if choice <= 0, return this object
    return the last object
    

    For an example: imagine you have two objects, one with weight 2, another with weight 4. You generate a number from 0 to 6. If choice is between 0 and 2, which will happen with 2/6 = 1/3 probability, then it will get subtracted by 2 and the first object is chosen. If choice is between 2 and 6, which will happen with 4/6 = 2/3 probability, then the first subtraction will still have choice being > 0, and the second subtraction will make the 2nd object get chosen.

    0 讨论(0)
  • 2020-12-02 08:43

    A very easy and simple way of doing this is to set weights for each of the values, and it wouldn't require much memory.

    You could probably use a hash/dictionary to do this.

    What you'll want to do is to have the random number, x, multiplied and summed over the entire set of things you want selected, and divide that result over the number of objects in your set.

    Pseudo-code:

    objectSet = [(object1, weight1), ..., (objectN, weightN)]
    sum = 0
    rand = random()
    for obj, weight in objectSet
        sum = sum+weight*rand
    choice = objectSet[floor(sum/objectSet.size())]
    

    EDIT: I just thought of how slow my code would be with very large sets (it's O(n)). The following pseudo-code is O(log(n)), and is basically using a binary search.

    objectSet = [(object1, weight1), ..., (objectN, weightN)]
    sort objectSet from less to greater according to weights
    choice = random() * N # where N is the number of objects in objectSet
    do a binary search until you have just one answer
    

    There are implementations of binary search in Python all over the 'net, so no need repeating here.

    0 讨论(0)
提交回复
热议问题