I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat
You could make an iterator that gives permutations. You offset your read into a file by the amount it gives. Because the iterator gives permutations, you will never read the same data twice.
All the permutations of a set of N elements can be generated by transpositions, which are permutations that swap the 0th and the ith element (assuming indexing from 0) and leave all other elements in their place. So you can make a random permutation by composing some randomly chosen transpositions. Here's an example written in Python:
import random
class Transposer:
def __init__(self,i):
"""
(Indexes start at 0)
Swap 0th index and ith index, otherwise identity mapping.
"""
self.i = i
def map(self,x):
if x == 0:
return self.i
if x == self.i:
return 0
return x
class RandomPermuter:
def __init__(self,n_gens,n):
"""
Picks n_gens integers in [0,n) to make transposers that, when composed,
form a permutation of a set of n elements. Of course if there are an even number of drawn
integers that are equal, they cancel each other out. We could keep
drawing numbers until we have n_gens unique numbers... but we don't for
this demo.
"""
gen_is = [random.randint(0,n-1) for _ in range(n_gens)]
self.trans = [Transposer(g) for g in gen_is]
def map(self,x):
for t in self.trans:
x = t.map(x)
return x
rp = RandomPermuter(10,10)
# Use these numbers to seek into a file
print(*[rp.map(x) for x in range(10)])