shuffle a large list of items without loading in memory

后端 未结 6 1851
礼貌的吻别
礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答
  •  借酒劲吻你
    2021-01-07 21:16

    You could make an iterator that gives permutations. You offset your read into a file by the amount it gives. Because the iterator gives permutations, you will never read the same data twice.

    All the permutations of a set of N elements can be generated by transpositions, which are permutations that swap the 0th and the ith element (assuming indexing from 0) and leave all other elements in their place. So you can make a random permutation by composing some randomly chosen transpositions. Here's an example written in Python:

    import random
    
    class Transposer:
        def __init__(self,i):
            """
            (Indexes start at 0)
            Swap 0th index and ith index, otherwise identity mapping.
            """
            self.i = i
        def map(self,x):
            if x == 0:
                return self.i
            if x == self.i:
                return 0
            return x
    
    class RandomPermuter:
        def __init__(self,n_gens,n):
            """
            Picks n_gens integers in [0,n) to make transposers that, when composed,
            form a permutation of a set of n elements. Of course if there are an even number of drawn
            integers that are equal, they cancel each other out. We could keep
            drawing numbers until we have n_gens unique numbers... but we don't for
            this demo.
            """
            gen_is = [random.randint(0,n-1) for _ in range(n_gens)]
            self.trans = [Transposer(g) for g in gen_is]
        def map(self,x):
            for t in self.trans:
                x = t.map(x)
            return x
    
    rp = RandomPermuter(10,10)
    
    # Use these numbers to seek into a file
    print(*[rp.map(x) for x in range(10)])
    

提交回复
热议问题