Python random sample generator (comfortable with huge population sizes)

后端未结

关注

 5  636

既然无缘 2021-01-14 06:57

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don\'t know in advance the size of the sample? You end

5条回答

迷失自我 (楼主)

2021-01-14 07:33

Here is another idea. So for huge population we would like to keep some info about selected records. In your case you keep one integer index per selected record - 32bit or 64bbit integer, plus some code to do reasonable search wrt selected/not selected. In case of large number of selected records this record keeping might be prohibitive. What I would propose is to use Bloom filter for selected indeces set. False positive matches are possible, but false negatives are not, thus no risk to get duplicated records. It does introduce slight bias - false positives records will be excluded from sampling. But memory efficiency is good, fewer than 10 bits per element are required for a 1% false positive probability. So if you select 5% of the population and have 1% false positive, you missed 0.0005 of your population, depending on requirements might be ok. If you want lower false positive, use more bits. But memory efficiency would be a lot better, though I expect there is more code to execute per record sample.

Sorry, no code

0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...