shuffle a large list of items without loading in memory

后端未结

关注

 6  1851

礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答

借酒劲吻你 (楼主)

2021-01-07 21:16

You could make an iterator that gives permutations. You offset your read into a file by the amount it gives. Because the iterator gives permutations, you will never read the same data twice.

All the permutations of a set of N elements can be generated by transpositions, which are permutations that swap the 0th and the ith element (assuming indexing from 0) and leave all other elements in their place. So you can make a random permutation by composing some randomly chosen transpositions. Here's an example written in Python:

import random

class Transposer:
    def __init__(self,i):
        """
        (Indexes start at 0)
        Swap 0th index and ith index, otherwise identity mapping.
        """
        self.i = i
    def map(self,x):
        if x == 0:
            return self.i
        if x == self.i:
            return 0
        return x

class RandomPermuter:
    def __init__(self,n_gens,n):
        """
        Picks n_gens integers in [0,n) to make transposers that, when composed,
        form a permutation of a set of n elements. Of course if there are an even number of drawn
        integers that are equal, they cancel each other out. We could keep
        drawing numbers until we have n_gens unique numbers... but we don't for
        this demo.
        """
        gen_is = [random.randint(0,n-1) for _ in range(n_gens)]
        self.trans = [Transposer(g) for g in gen_is]
    def map(self,x):
        for t in self.trans:
            x = t.map(x)
        return x

rp = RandomPermuter(10,10)

# Use these numbers to seek into a file
print(*[rp.map(x) for x in range(10)])

0 讨论(0)

查看其它6个回答