Random sample from a very long iterable, in python

后端 未结 5 1802
南笙
南笙 2020-12-11 07:58

I have a long python generator that I want to \"thin out\" by randomly selecting a subset of values. Unfortunately, random.sample() will not work with arbitrary

相关标签:
5条回答
  • 2020-12-11 08:14

    If you needed a subset of the original iterator with fixed frequency (i.e., if the generator generates 10000 numbers, you want "statistically" 100 of them, and if it generates 1000000 numbers, you want 10000 of them - always 1%), you would have wrapped the iterator in a construct yielding the inner loop's results with probability of 1%.

    So I guess you want instead a fixed number of samples from a source of unknown cardinality, as in the Perl algorithm you mention.

    You can wrap the iterator in a construct holding a small memory of its own for the purpose of keeping track of the reservoir, and cycling it with decreasing probability.

    import random
    
    def reservoir(iterator, size):
        n = size
        R = iterator[0:n]
        for e in iterator:
            j = random.randint(0, n-1)
            n = n + 1
            if (j < size):
                    R[j] = e
        return R
    

    So

    print reservoir(range(1, 1000), 3)
    

    might print out

    [656, 774, 828]
    

    I have tried generating one million rounds as above, and comparing the distributions of the three columns with this filter (I expected a Gaussian distribution).

    #                get first column and clean it
    python file.py | cut -f 1 -d " " | tr -cd "0-9\n" \
        | sort | uniq -c | cut -b1-8 | tr -cd "0-9\n" | sort | uniq -c
    

    and while not (yet) truly Gaussian, it looks good enough to me.

    0 讨论(0)
  • 2020-12-11 08:21

    One possible method is to build a generator around the iterator to select random elements:

    def random_wrap(iterator, threshold):
        for item in iterator:
            if random.random() < threshold:
                yield item
    

    This method would be useful when you don't know the length and the possible size of the iterator would be prohibitive. Note that guaranteeing the size of the final list is problematic.

    Some sample runs:

    >>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
    ['f', 'h', 'i', 'r', 'w', 'x']
    
    >>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
    ['j', 'r', 's', 'u', 'x']
    
    >>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
    ['c', 'e', 'h', 'n', 'o', 'r', 'z']
    
    >>> list(random_wrap(iter('abcdefghijklmnopqrstuvwxyz'), 0.25))
    ['b', 'c', 'e', 'h', 'j', 'p', 'r', 's', 'u', 'v', 'x']
    
    0 讨论(0)
  • 2020-12-11 08:21

    Use the itertools.compress() function, with a random selector function:

    itertools.compress(long_sequence, (random.randint(0, 100) < 10 for x in itertools.repeat(1)))
    
    0 讨论(0)
  • 2020-12-11 08:23

    Use O(n) Algorithm R https://en.wikipedia.org/wiki/Reservoir_sampling, to select k random elements from iterable:

    import itertools
    import random
    
    def reservoir_sample(iterable, k):
        it = iter(iterable)
        if not (k > 0):
            raise ValueError("sample size must be positive")
    
        sample = list(itertools.islice(it, k)) # fill the reservoir
        random.shuffle(sample) # if number of items less then *k* then
                               #   return all items in random order.
        for i, item in enumerate(it, start=k+1):
            j = random.randrange(i) # random [0..i)
            if j < k:
                sample[j] = item # replace item with gradually decreasing probability
        return sample
    

    Example:

    >>> reservoir_sample(iter('abcdefghijklmnopqrstuvwxyz'), 5)
    ['w', 'i', 't', 'b', 'e']
    

    reservoir_sample() code is from this answer.

    0 讨论(0)
  • 2020-12-11 08:26

    Since you know the length the data returned by your iterable, you can use xrange() to quickly generate indices into your iterable. Then you can just run the iterable until you've grabbed all of the data:

    import random
    
    def sample(it, length, k):
        indices = random.sample(xrange(length), k)
        result = [None]*k
        for index, datum in enumerate(it):
            if index in indices:
                result[indices.index(index)] = datum
        return result
    
    print sample(iter("abcd"), 4, 2)
    

    In the alternative, here is an implementation of resevior sampleing using "Algorithm R":

    import random
    
    def R(it, k):
        '''https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R'''
        it = iter(it)
        result = []
        for i, datum in enumerate(it):
            if i < k:
                result.append(datum)
            else:
                j = random.randint(0, i-1)
                if j < k:
                    result[j] = datum
        return result
    
    print R(iter("abcd"), 2)
    

    Note that algorithm R doesn't provide a random order for the results. In the example given, 'b' will never precede 'a' in the results.

    0 讨论(0)
提交回复
热议问题