I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t
You can use a variation of the probabilistic method for choosing a random line in a file.
Instead of just keeping a single number that gets chosen, you can keep a buffer of size C
. For each line number, n
, in the file with N
lines, you want to choose that line with probability C/n
(rather than the original 1/n
. If the number is selected, you then choose a random location from the C-length buffer to evict.
Here's how it works:
import random
C = 2
fpath = 'somelines.txt'
buffer = []
f = open(fpath, 'r')
for line_num, line in enumerate(f):
n = line_num + 1.0
r = random.random()
if n <= C:
buffer.append(line.strip())
elif r < C/n:
loc = random.randint(0, C-1)
buffer[loc] = line.strip()
This requires a single pass through the file (so it's linear time) and returns exactly C
lines from the file. Each line will have probability C/N
of being selected.
To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:
a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947
I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it
Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.
This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).
chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
if random() < chances_selected:
result.append(line)
Another solution is possible if you know the total number of lines - generate 1 million random numbers (random.sample(xrange(n), 1000000)
) up to the total number of lines as a set, then use:
for i, line in enumerate(csvfile):
if i in lines_to_grab:
yield line
This will get you exactly 1 million lines in an unbiased way, but you need to have the number of lines beforehand.
You can rewrite the file with fixed-length records, and then perform random access on the intermediate file later:
ifile = file.open("inputfile.csv")
ofile = file.open("intermediatefile.csv",'w')
for line in ifile:
ofile.write(line.rstrip('\n').ljust(15)+'\n')
Then, you can do:
import random
ifile = file.open("intermediatefile.csv")
lines = []
samples = random.sample(range(nlines))
for sample in samples:
ifile.seek(sample)
lines.append(ifile.readline())
Requires more disk space, and the first program may take some time to run, but it allows unlimited later random access to records with the second.
In this method, we generate a random number set whose number of elements is equal to the number of lines to be read, with its range being the number of rows present in the data. It is then sorted from smallest to largest and stored.
Then the csv file is read line by line and a line_counter
is in place to denote the row number. This line_counter
is then checked with the first element of the sorted random number list and if they are same then that specific line is written into the new csv file and the first element is removed from the list and the previously second element takes the place of the first and the cycle continues.
import random
k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read)
Num=sorted(k)
line_counter = 0
with open(input_file,'rb') as file_handle:
reader = csv.reader(file_handle)
with open(output_file,'wb') as outfile:
a=csv.writer(outfile)
for line in reader:
line_counter += 1
if line_counter == Num[0]:
a.writerow(line)
Num.remove(Num[0])
if len(Num)==0:
break