Iterative or Lazy Reservoir Sampling

删除回忆录丶 提交于 2019-12-04 04:12:35

If you know in advance the total number of items that will be yielded by an iterable population, it is possible to yield the items of a sample of population as you come to them (not only after reaching the end). If you don't know the population size ahead of time, this is impossible (as the probability of any item being in the sample can't be be calculated).

Here's a quick generator that does this:

def sample_given_size(population, population_size, sample_size):
    for item in population:
        if random.random() < sample_size / population_size:
            yield item
            sample_size -= 1
        population_size -= 1

Note that the generator yields items in the order they appear in the population (not in random order, like random.sample or most reservoir sampling codes), so a slice of the sample will not be a random subsample!

If population size is known before hand, can't you just generate sample_size random "indices" (in the stream) and use that to do a lazy yield? You won't have to read the entire stream.

For instance, if population_size was 100, and sample_size was 3, you generate a random set of integers from 1 to 100, say you get 10, 67 and 72.

Now you yield the 10th, 62nd and 72nd elements of the stream and ignore the rest.

I guess I don't understand the question.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!