Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 1836
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

相关标签:
13条回答
  • 2020-11-27 03:16

    For example, you have the loan.csv, you can use this script to easily load the specified number of random items.

    data = pd.read_csv('loan.csv').sample(10000, random_state=44)
    
    0 讨论(0)
  • 2020-11-27 03:21

    No pandas!

    import random
    from os import fstat
    from sys import exit
    
    f = open('/usr/share/dict/words')
    
    # Number of lines to be read
    lines_to_read = 100
    
    # Minimum and maximum bytes that will be randomly skipped
    min_bytes_to_skip = 10000
    max_bytes_to_skip = 1000000
    
    def is_EOF():
        return f.tell() >= fstat(f.fileno()).st_size
    
    # To accumulate the read lines
    sampled_lines = []
    
    for n in xrange(lines_to_read):
        bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
        f.seek(bytes_to_skip, 1)
        # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
        # Skip current entire line
        f.readline()
        if not is_EOF():
            sampled_lines.append(f.readline())
        else:
            # Go to the begginig of the file ...
            f.seek(0, 0)
            # ... and skip lines again
            f.seek(bytes_to_skip, 1)
            # If it has reached the EOF again
            if is_EOF():
                print "You have skipped more lines than your file has"
                print "Reduce the values of:"
                print "   min_bytes_to_skip"
                print "   max_bytes_to_skip"
                exit(1)
            else:
                f.readline()
                sampled_lines.append(f.readline())
    
    print sampled_lines
    

    You'll end up with a sampled_lines list. What kind of statistics do you mean?

    0 讨论(0)
  • 2020-11-27 03:22

    read the data file

    import pandas as pd
    df = pd.read_csv('data.csv', 'r')
    

    First check the shape of df

    df.shape()
    

    create the small sample of 1000 raws from df

    sample_data = df.sample(n=1000, replace='False')
    

    #check the shape of sample_data

    sample_data.shape()
    
    0 讨论(0)
  • 2020-11-27 03:23

    Here is an algorithm that doesn't require counting the number of lines in the file beforehand, so you only need to read the file once.

    Say you want m samples. First, the algorithm keeps the first m samples. When it sees the i-th sample (i > m), with probability m/i, the algorithm uses the sample to randomly replace an already selected sample.

    By doing so, for any i > m, we always have a subset of m samples randomly selected from the first i samples.

    See code below:

    import random
    
    n_samples = 10
    samples = []
    
    for i, line in enumerate(f):
        if i < n_samples:
            samples.append(line)
        elif random.random() < n_samples * 1. / (i+1):
                samples[random.randint(0, n_samples-1)] = line
    
    0 讨论(0)
  • 2020-11-27 03:23

    TL;DR

    If you know the size of the sample you want, but not the size of the input file, you can efficiently load a random sample out of it with the following pandas code:

    import pandas as pd
    import numpy as np
    
    filename = "data.csv"
    sample_size = 10000
    batch_size = 200
    
    rng = np.random.default_rng()
    
    sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
    
    sample = sample_reader.get_chunk(sample_size)
    
    for chunk in sample_reader:
        chunk.index = rng.integers(sample_size, size=len(chunk))
        sample.loc[chunk.index] = chunk
    

    Explanation

    It's not always trivial to know the size of the input CSV file.

    If there are embedded line breaks, tools like wc or shuf will give you the wrong answer or just make a mess out of your data.

    So, based on desktable's answer, we can treat the first sample_size lines of the file as the initial sample and then, for each subsequent line in the file, randomly replace a line in the initial sample.

    To do that efficiently, we load the CSV file using a TextFileReader by passing the chunksize= parameter:

    sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
    

    First, we get the initial sample:

    sample = sample_reader.get_chunk(sample_size)
    

    Then, we iterate over the remaining chunks of the file, replacing the index of each chunk with a sequence of random integers as long as size of the chunk, but where each integer is in the range of the index of the initial sample (which happens to be the same as range(sample_size)):

    for chunk in sample_reader:
        chunk.index = rng.integers(sample_size, size=len(chunk))
    

    And use this reindexed chunk to replace (some of the) lines in the sample:

    sample.loc[chunk.index] = chunk
    

    After the for loop, you'll have a dataframe at most sample_size rows long, but with random lines selected from the big CSV file.

    To make the loop more efficient, you can make batch_size as large as your memory allows (and yes, even larger than sample_size if you can).

    Notice that, while creating the new chunk index with np.random.default_rng().integers(), we use len(chunk) as the new chunk index size instead of simply batch_size because the last chunk in the loop could be smaller.

    On the other hand, we use sample_size instead of len(sample) as the "range" of the random integers, even though there could be less lines in the file than sample_size. This is because there won't be any chunks left to loop over in this case so that will never be a problem.

    0 讨论(0)
  • 2020-11-27 03:24

    The following code reads first the header, and then a random sample on the other lines:

    import pandas as pd
    import numpy as np
    
    filename = 'hugedatafile.csv'
    nlinesfile = 10000000
    nlinesrandomsample = 10000
    lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
    df = pd.read_csv(filename, skiprows=lines2skip)
    
    0 讨论(0)
提交回复
热议问题