How to do weighted random sample of categories in python

后端未结

关注

 9  2153

Given a list of tuples where each tuple consists of a probability and an item I\'d like to sample an item according to its probability. For example, give the list [ (.3, \'a\'),

相关标签:

9条回答

南笙

2021-01-31 18:11

I'm not sure if this is the pythonic way of doing what you ask, but you could use random.sample(['a','a','a','b','b','b','b','c','c','c'],k) where k is the number of samples you want.

For a more robust method, bisect the unit interval into sections based on the cumulative probability and draw from the uniform distribution (0,1) using random.random(). In this case the subintervals would be (0,.3)(.3,.7)(.7,1). You choose the element based on which subinterval it falls into.

0 讨论(0)
发布评论:

提交评论
- 加载中...

太阳男子

2021-01-31 18:15

import numpy

n = 1000
pairs = [(.3, 'a'), (.3, 'b'), (.4, 'c')]
probabilities = numpy.random.multinomial(n, zip(*pairs)[0])
result = zip(probabilities, zip(*pairs)[1])
# [(299, 'a'), (299, 'b'), (402, 'c')]
[x[0] * x[1] for x in result]
# ['aaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbb', 'cccccccccccccccccccc']

How exactly would you like to receive the results?

0 讨论(0)

南方客

2021-01-31 18:15
This might do what you want:
```
numpy.array([.3,.4,.3]).cumsum().searchsorted(numpy.random.sample(5))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
暖寄归人

2021-01-31 18:18
Just inspired of sholte's very straightforward (and correct) answer: I'll just demonstrate how easy it will be to extend it to handle arbitrary items, like:
```
In []: s= array([.3, .4, .3]).cumsum().searchsorted(sample(54))
In []: c, _= histogram(s, bins= arange(4))
In []: [item* c[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccccccc']
```
Update:
Based on the feedback of phant0m, it turns out that an even more straightforward solution can be implemented based on multinomial, like:
```
In []: s= multinomial(54, [.3, .4, .3])
In []: [item* s[i] for i, item in enumerate('abc')]
Out[]: ['aaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbb', 'cccccccccccc']
```
IMHO here we have a nice summary of empirical cdf and multinomial based sampling yielding similar results. So, in a summary, pick it up one which suits best for your purposes.
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2021-01-31 18:20

Howabout creating 3 "a", 4 "b" and 3 "c" in a list an then just randomly select one. With enough iterations you will get the desired probability.

0 讨论(0)
发布评论:

提交评论
- 加载中...

温柔的废话

2021-01-31 18:21

This may be of marginal benefit but I did it this way:

import scipy.stats as sps
N=1000
M3 = sps.multinomial.rvs(1, p = [0.3,0.4,0.3], size=N, random_state=None)
M3a = [ np.where(r==1)[0][0] for r in M3 ] # convert 1-hot encoding to integers

This is similar to @eat's answer.

0 讨论(0)

1 2 下一页