Read a small random sample from a big CSV file into a Python data frame

后端未结

关注

 13  1853

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

相关标签:

13条回答

孤街浪徒

2020-11-27 03:16
For example, you have the loan.csv, you can use this script to easily load the specified number of random items.
```
data = pd.read_csv('loan.csv').sample(10000, random_state=44)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

粉色の甜心

2020-11-27 03:21

No pandas!

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

You'll end up with a sampled_lines list. What kind of statistics do you mean?

0 讨论(0)

旧巷少年郎

2020-11-27 03:22
read the data file
```
import pandas as pd
df = pd.read_csv('data.csv', 'r')
```
First check the shape of df
```
df.shape()
```
create the small sample of 1000 raws from df
```
sample_data = df.sample(n=1000, replace='False')
```
#check the shape of sample_data
```
sample_data.shape()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
余生分开走

2020-11-27 03:23
Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

See code below:
```
import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-11-27 03:23
TL;DR

If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:
```
import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk
```
Explanation

It's not always trivial to know the size of the input CSV file.

If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.

So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.

To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:
```
sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
```
First, we get the initial sample:
```
sample = sample_reader.get_chunk(sample_size)
```
Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):
```
for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
```
And use this reindexed chunk to replace (some of the) lines in the sample:
```
sample.loc[chunk.index] = chunk
```
After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.

To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).

Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.

On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.
0 讨论(0)
发布评论:

提交评论
- 加载中...

生来不讨喜

2020-11-27 03:24

The following code reads first the header, and then a random sample on the other lines:

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

0 讨论(0)

1 2 3 下一页

Read a small random sample from a big CSV file into a Python data frame

read the data file

First check the shape of df

create the small sample of 1000 raws from df

TL;DR

Explanation