Python random sample with a generator / iterable / iterator

匿名 (未验证) 提交于 2019-12-03 02:33:02

问题:

Do you know if there is a way to get python's random.sample to work with a generator object. I am trying to get a random sample from a very large text corpus. The problem is that random.sample() raises the following error.

TypeError: object of type 'generator' has no len() 

I was thinking that maybe there is some way of doing this with something from itertools but couldn't find anything with a bit of searching.

A somewhat made up example:

import random def list_item(ls):     for item in ls:         yield item  random.sample( list_item(range(100)), 20 ) 


UPDATE


As per MartinPieters's request I did some timing of the currently proposed three methods. The results are as follows.

Sampling 1000 from 10000 Using iterSample 0.0163 s Using sample_from_iterable 0.0098 s Using iter_sample_fast 0.0148 s  Sampling 10000 from 100000 Using iterSample 0.1786 s Using sample_from_iterable 0.1320 s Using iter_sample_fast 0.1576 s  Sampling 100000 from 1000000 Using iterSample 3.2740 s Using sample_from_iterable 1.9860 s Using iter_sample_fast 1.4586 s  Sampling 200000 from 1000000 Using iterSample 7.6115 s Using sample_from_iterable 3.0663 s Using iter_sample_fast 1.4101 s  Sampling 500000 from 1000000 Using iterSample 39.2595 s Using sample_from_iterable 4.9994 s Using iter_sample_fast 1.2178 s  Sampling 2000000 from 5000000 Using iterSample 798.8016 s Using sample_from_iterable 28.6618 s Using iter_sample_fast 6.6482 s 

So it turns out that the array.insert has a serious drawback when it comes to large sample sizes. The code I used to time the methods

from heapq import nlargest import random import timeit   def iterSample(iterable, samplesize):     results = []     for i, v in enumerate(iterable):         r = random.randint(0, i)         if r 

I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled 1000 elements from 10000 100000 times and computed the average frequency of occurrence of each item in the population which turns out to be ~.1 as one would expect for all three methods.

回答1:

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):     results = []     iterator = iter(iterable)     # Fill in the first samplesize elements:     try:         for _ in xrange(samplesize):             results.append(iterator.next())     except StopIteration:         raise ValueError("Sample larger than population.")     random.shuffle(results)  # Randomize their positions     for i, v in enumerate(iterator, samplesize):         r = random.randint(0, i)         if r 

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

  • iterSample: 5.05s
  • iter_sample_fast: 2.64s


回答2:

You can't.

You have two options: read the whole generator into a list, then sample from that list, or use a method that reads the generator one by one and picks the sample from that:

import random  def iterSample(iterable, samplesize):     results = []      for i, v in enumerate(iterable):         r = random.randint(0, i)         if r 

This method adjusts the chance that the next item is part of the sample based on the number of items in the iterable so far. It doesn't need to hold more than samplesize items in memory.

The solution isn't mine; it was provided as part of another answer here on SO.



回答3:

Just for the heck of it, here's a one-liner that samples k elements without replacement from the n items generated in O(n lg k) time:

from heapq import nlargest  def sample_from_iterable(it, k):     return (x for _, x in nlargest(k, ((random.random(), x) for x in it))) 


回答4:

If the number of items in the iterator is known (by elsewhere counting the items), another approach is:

def iter_sample(iterable, iterlen, samplesize):     if iterlen 

I find this quicker, especially when sampsize is small in relation to iterlen. When the whole, or near to the whole, sample is asked for however, there are issues.

iter_sample (iterlen=10000, samplesize=100) time: (1, 'ms') iter_sample_fast (iterlen=10000, samplesize=100) time: (15, 'ms')

iter_sample (iterlen=1000000, samplesize=100) time: (65, 'ms') iter_sample_fast (iterlen=1000000, samplesize=100) time: (1477, 'ms')

iter_sample (iterlen=1000000, samplesize=1000) time: (64, 'ms') iter_sample_fast (iterlen=1000000, samplesize=1000) time: (1459, 'ms')

iter_sample (iterlen=1000000, samplesize=10000) time: (86, 'ms') iter_sample_fast (iterlen=1000000, samplesize=10000) time: (1480, 'ms')

iter_sample (iterlen=1000000, samplesize=100000) time: (388, 'ms') iter_sample_fast (iterlen=1000000, samplesize=100000) time: (1521, 'ms')

iter_sample (iterlen=1000000, samplesize=1000000) time: (25359, 'ms') iter_sample_fast (iterlen=1000000, samplesize=1000000) time: (2178, 'ms')



回答5:

Fastest method until proven otherwise when you have an idea about how long the generator is (and will be asymptotically uniformly distributed):

def gen_sample(generator_list, sample_size, iterlen):     num = 0     inds = numpy.random.random(iterlen) 

It is both the fastest on the small iterable as well as the huge iterable (and probably all in between then)

# Huge res = gen_sample(xrange(5000000), 200000, 5000000) timing: 1.22s  # Small z = gen_sample(xrange(10000), 1000, 10000)  timing: 0.000441     


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!