The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?
For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
No pandas!
import random
from os import fstat
from sys import exit
f = open('/usr/share/dict/words')
# Number of lines to be read
lines_to_read = 100
# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000
def is_EOF():
return f.tell() >= fstat(f.fileno()).st_size
# To accumulate the read lines
sampled_lines = []
for n in xrange(lines_to_read):
bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
f.seek(bytes_to_skip, 1)
# After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
# Skip current entire line
f.readline()
if not is_EOF():
sampled_lines.append(f.readline())
else:
# Go to the begginig of the file ...
f.seek(0, 0)
# ... and skip lines again
f.seek(bytes_to_skip, 1)
# If it has reached the EOF again
if is_EOF():
print "You have skipped more lines than your file has"
print "Reduce the values of:"
print " min_bytes_to_skip"
print " max_bytes_to_skip"
exit(1)
else:
f.readline()
sampled_lines.append(f.readline())
print sampled_lines
You'll end up with a sampled_lines list. What kind of statistics do you mean?
import pandas as pd
df = pd.read_csv('data.csv', 'r')
df.shape()
sample_data = df.sample(n=1000, replace='False')
#check the shape of sample_data
sample_data.shape()
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.
Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.
By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.
See code below:
import random
n_samples = 10
samples = []
for i, line in enumerate(f):
if i < n_samples:
samples.append(line)
elif random.random() < n_samples * 1. / (i+1):
samples[random.randint(0, n_samples-1)] = line
If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas
code:
import pandas as pd
import numpy as np
filename = "data.csv"
sample_size = 10000
batch_size = 200
rng = np.random.default_rng()
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
sample = sample_reader.get_chunk(sample_size)
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
sample.loc[chunk.index] = chunk
It's not always trivial to know the size of the input CSV file.
If there are embedded line breaks, tools like wc
or shuf
will give you the wrong answer or just make a mess out of your data.
So, based on desktable's answer, we can treat the first sample_size
lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.
To do that efficiently, we load the CSV file using a TextFileReader
by passing the chunksize=
parameter:
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
First, we get the initial sample:
sample = sample_reader.get_chunk(sample_size)
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index
of the initial sample (which happens to be the same as range(sample_size)
):
for chunk in sample_reader:
chunk.index = rng.integers(sample_size, size=len(chunk))
And use this reindexed chunk to replace (some of the) lines in the sample:
sample.loc[chunk.index] = chunk
After the for
loop, you'll have a dataframe at most sample_size
rows long, but with random lines selected from the big CSV file.
To make the loop more efficient, you can make batch_size
as large as your memory allows (and yes, even larger than sample_size
if you can).
Notice that, while creating the new chunk index with np.random.default_rng().integers()
, we use len(chunk)
as the new chunk index size instead of simply batch_size
because the last chunk in the loop could be smaller.
On the other hand, we use sample_size
instead of len(sample)
as the "range" of the random integers, even though there could be less lines in the file than sample_size
. This is because there won't be any chunks left to loop over in this case so that will never be a problem.
The following code reads first the header, and then a random sample on the other lines:
import pandas as pd
import numpy as np
filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)