Python random sample generator (comfortable with huge population sizes)

后端 未结 5 639
既然无缘
既然无缘 2021-01-14 06:57

As you might know random.sample(population,sample_size) quickly returns a random sample, but what if you don\'t know in advance the size of the sample? You end

相关标签:
5条回答
  • 2021-01-14 07:28

    At first, I would split the population into blocks. The function to do the block sampling can easily be a generator, being able to process sample of arbitrary size. This also allows you to make the function a generator.

    Imagine infinite population, a population block of 512 and sample size of 8. This means you could gather as many samples as you need, and for future reduction again sample the already sampled space (for 1024 blocks this means 8196 samples from which you can sample again).

    At the same time, this allows for parallel processing which may be feasible in case of very large samples.

    Example considering in-memory population

    import random
    
    population = [random.randint(0, 1000) for i in range(0, 150000)]
    
    def sample_block(population, block_size, sample_size):
        block_number = 0
        while 1:
            try:
                yield random.sample(population[block_number * block_size:(block_number + 1) * block_size], sample_size)
                block_number += 1
            except ValueError:
                break
    
    sampler = sample_block(population, 512, 8)
    samples = []
    
    try:
        while 1:
            samples.extend(sampler.next())
    except StopIteration:
        pass
    
    print random.sample(samples, 200)
    

    If population was external to the script (file, block) the only modification is that you would have to load appropriate chunk to a memory. Proof of concept how sampling of infinite population could look like:

    import random
    import time
    
    def population():
        while 1:
            yield random.randint(0, 10000)
    
    def reduced_population(samples):
        for sample in samples:
            yield sample
    
    def sample_block(generator, block_size, sample_size):
    
        block_number = 0
        block = []
        while 1:
            block.append(generator.next())
            if len(block) == block_size:
                s = random.sample(block, sample_size)
                block_number += 1
                block = []
                print 'Sampled block {} with result {}.'.format(block_number, s)
                yield s
    
    samples = []
    result = []
    reducer = sample_block(population(), 512, 12)
    
    try:
        while 1:
            samples.append(reducer.next())
            if len(samples) == 1000:
                sampler = sample_block(reduced_population(samples), 1000, 15)
                result.append(list(sampler))
                time.sleep(5)
    except StopIteration:
        pass
    

    Ideally, you'd also gather the samples and again sample them.

    0 讨论(0)
  • 2021-01-14 07:33

    Here is another idea. So for huge population we would like to keep some info about selected records. In your case you keep one integer index per selected record - 32bit or 64bbit integer, plus some code to do reasonable search wrt selected/not selected. In case of large number of selected records this record keeping might be prohibitive. What I would propose is to use Bloom filter for selected indeces set. False positive matches are possible, but false negatives are not, thus no risk to get duplicated records. It does introduce slight bias - false positives records will be excluded from sampling. But memory efficiency is good, fewer than 10 bits per element are required for a 1% false positive probability. So if you select 5% of the population and have 1% false positive, you missed 0.0005 of your population, depending on requirements might be ok. If you want lower false positive, use more bits. But memory efficiency would be a lot better, though I expect there is more code to execute per record sample.

    Sorry, no code

    0 讨论(0)
  • 2021-01-14 07:36

    That's what generators for, I believe. Here is an example of Fisher-Yates-Knuth sampling via generator/yield, you get events one by one and stop when you want to.

    Code updated

    import random
    import numpy
    import array
    
    class populationFYK(object):
        """
        Implementation of the Fisher-Yates-Knuth shuffle
        """
        def __init__(self, population):
            self._population = population      # reference to the population
            self._length     = len(population) # lengths of the sequence
            self._index      = len(population)-1 # last unsampled index
            self._popidx     = array.array('i', range(0,self._length))
    
            # array module vs numpy
            #self._popidx     = numpy.empty(self._length, dtype=numpy.int32)
            #for k in range(0,self._length):
            #    self._popidx[k] = k
    
    
        def swap(self, idx_a, idx_b):
            """
            Swap two elements in population
            """
            temp = self._popidx[idx_a]
            self._popidx[idx_a] = self._popidx[idx_b]
            self._popidx[idx_b] = temp
    
        def sample(self):
            """
            Yield one sampled case from population
            """
            while self._index >= 0:
                idx = random.randint(0, self._index) # index of the sampled event
    
                if idx != self._index:
                    self.swap(idx, self._index)
    
                sampled = self._population[self._popidx[self._index]] # yielding it
    
                self._index -= 1 # one less to be sampled
    
                yield sampled
    
        def index(self):
            return self._index
    
        def restart(self):
            self._index = self._length - 1
            for k in range(0,self._length):
                self._popidx[k] = k
    
    if __name__=="__main__":
        population = [1,3,6,8,9,3,2]
    
        gen = populationFYK(population)
    
        for k in gen.sample():
            print(k)
    
    0 讨论(0)
  • 2021-01-14 07:38

    I wrote (in Python 2.7.9) a random sampler generator (of indexes) whose speed depends only on sample size (it should be O(ns log(ns)) where ns is sample size). So it is fast when sample size is small compared to population size, because it does NOT depend at all on population size. It doesn't build any population collection, it just picks random indexes and uses a kind of bisect method on sampled indexes to avoid duplicates and keep then sorted. Given an iterable population, here's how to use itersample generator:

    import random
    sampler=itersample(len(population))
    next_pick=sampler.next() # pick the next random (index of) element
    

    or

    import random
    sampler=itersample(len(population))
    sample=[]
    for index in sampler:
        # do something with (index of) picked element
        sample.append(index) # build a sample
        if some_condition: # stop sampling when needed
            break
    

    If you need the actual elements and not just the indexes, just apply population iterable to the index when needed (population[sampler.next()] and population[index] respectively for first and second example).

    The results of some tests show that speed does NOT depend on population size, so if you need to randomly pick only 10 elements from a population of 100 billions, you pay only for 10 (remember, we don't know in advance how many elements we'll pick, otherwise you'd better use random.sample).

    Sampling 1000 from 1000000
    Using itersample 0.0324 s
    
    Sampling 1000 from 10000000
    Using itersample 0.0304 s
    
    Sampling 1000 from 100000000
    Using itersample 0.0311 s
    
    Sampling 1000 from 1000000000
    Using itersample 0.0329 s
    

    Other tests confirm that running time is slightly more than linear with sample size:

    Sampling 100 from 1000000000
    Using itersample 0.0018 s
    
    Sampling 1000 from 1000000000
    Using itersample 0.0294 s
    
    Sampling 10000 from 1000000000
    Using itersample 0.4438 s
    
    Sampling 100000 from 1000000000
    Using itersample 8.8739 s
    

    Finally, here is the generator function itersample:

    import random
    def itersample(c): # c: population size
        sampled=[]
        def fsb(a,b): # free spaces before middle of interval a,b
            fsb.idx=a+(b+1-a)/2
            fsb.last=sampled[fsb.idx]-fsb.idx if len(sampled)>0 else 0
            return fsb.last
        while len(sampled)<c:
            sample_index=random.randrange(c-len(sampled))
            a,b=0,len(sampled)-1
            if fsb(a,a)>sample_index:
                yielding=sample_index
                sampled.insert(0,yielding)
                yield yielding
            elif fsb(b,b)<sample_index+1:
                yielding=len(sampled)+sample_index
                sampled.insert(len(sampled),yielding)
                yield yielding
            else: # sample_index falls inside sampled list
                while a+1<b:
                    if fsb(a,b)<sample_index+1:
                        a=fsb.idx
                    else:
                        b=fsb.idx
                yielding=a+1+sample_index
                sampled.insert(a+1,yielding)
                yield yielding
    
    0 讨论(0)
  • 2021-01-14 07:39

    You can get a sample of size K out of a population of size N by picking K non-repeating random-numbers in the range [0...N[ and treat them as indexes.

    Option a)

    You could generate such a index-sample using the well-known sample method.

    random.sample(xrange(N), K)
    

    From the Python docs about random.sample:

    To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population

    Option b)

    If you don't like the fact that random.sample already returns a list instead of a lazy generator of non-repeating random numbers, you can go fancy with Format-Preserving Encryption to encrypt a counter.

    This way you get a real generator of random indexes, and you can pick as many as you want and stop at any time, without getting any duplicates, which gives you dynamically sized sample sets.

    The idea is to construct an encryption scheme to encrypt the numbers from 0 to N. Now, for each time you want to get a sample from your population, you pick a random key for your encryption and start to encrypt the numbers from 0, 1, 2, ... onwards (this is the counter). Since every good encryption creates a random-looking 1:1 mapping you end up with non-repeating random integers you can use as indexes. The storage requirements during this lazy generation is just the initial key plus the current value of the counter.

    The idea was already discussed in Generating non-repeating random numbers in Python. There even is a python snippet linked: formatpreservingencryption.py

    A sample code using this snippet could be implemented like this:

    def itersample(population):
        # Get the size of the population
        N = len(population)
        # Get the number of bits needed to represent this number
        bits = (N-1).bit_length()
        # Generate some random key
        key = ''.join(random.choice(string.ascii_letters + string.digits) for _ in range(32))
        # Create a new crypto instance that encrypts binary blocks of width <bits>
        # Thus, being able to encrypt all numbers up to the nearest power of two
        crypter = FPEInteger(key=key, radix=2, width=bits)
    
        # Count up 
        for i in xrange(1<<bits):
            # Encrypt the current counter value
            x = crypter.encrypt(i)
            # If it is bigger than our population size, just skip it
            # Since we generate numbers up to the nearest power of 2, 
            # we have to skip up to half of them, and on average up to one at a time
            if x < N:
                # Return the randomly chosen element
                yield population[x]
    
    0 讨论(0)
提交回复
热议问题