Read a small random sample from a big CSV file into a Python data frame

后端 未结 13 1838
暖寄归人
暖寄归人 2020-11-27 02:37

The CSV file that I want to read does not fit into main memory. How can I read a few (~10K) random lines of it and do some simple statistics on the selected data frame?

相关标签:
13条回答
  • 2020-11-27 03:25

    This is not in Pandas, but it achieves the same result much faster through bash, while not reading the entire file into memory:

    shuf -n 100000 data/original.tsv > data/sample.tsv
    

    The shuf command will shuffle the input and the and the -n argument indicates how many lines we want in the output.

    Relevant question: https://unix.stackexchange.com/q/108581

    Benchmark on a 7M lines csv available here (2008):

    Top answer:

    def pd_read():
        filename = "2008.csv"
        n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
        s = 100000 #desired sample size
        skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
        df = pandas.read_csv(filename, skiprows=skip)
        df.to_csv("temp.csv")
    

    Timing for pandas:

    %time pd_read()
    CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
    Wall time: 18.9 s
    

    While using shuf:

    time shuf -n 100000 2008.csv > temp.csv
    
    real    0m1.583s
    user    0m1.445s
    sys     0m0.136s
    

    So shuf is about 12x faster and importantly does not read the whole file into memory.

    0 讨论(0)
  • 2020-11-27 03:33

    Assuming no header in the CSV file:

    import pandas
    import random
    
    n = 1000000 #number of records in file
    s = 10000 #desired sample size
    filename = "data.txt"
    skip = sorted(random.sample(range(n),n-s))
    df = pandas.read_csv(filename, skiprows=skip)
    

    would be better if read_csv had a keeprows, or if skiprows took a callback func instead of a list.

    With header and unknown file length:

    import pandas
    import random
    
    filename = "data.txt"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 10000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    
    0 讨论(0)
  • 2020-11-27 03:35
    class magic_checker:
        def __init__(self,target_count):
            self.target = target_count
            self.count = 0
        def __eq__(self,x):
            self.count += 1
            return self.count >= self.target
    
    min_target=100000
    max_target = min_target*2
    nlines = randint(100,1000)
    seek_target = randint(min_target,max_target)
    with open("big.csv") as f:
         f.seek(seek_target)
         f.readline() #discard this line
         rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))
    
    #do something to process the lines you got returned .. perhaps just a split
    print rand_lines
    print rand_lines[0].split(",")
    

    something like that should work I think

    0 讨论(0)
  • 2020-11-27 03:35

    use subsample

    pip install subsample
    subsample -n 1000 file.csv > file_1000_sample.csv
    
    0 讨论(0)
  • 2020-11-27 03:35

    You can also create a sample with the 10000 records before bringing it into the Python environment.

    Using Git Bash (Windows 10) I just ran the following command to produce the sample

    shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv
    

    To note: If your CSV has headers this is not the best solution.

    0 讨论(0)
  • 2020-11-27 03:37

    @dlm's answer is great but since v0.20.0, skiprows does accept a callable. The callable receives as an argument the row number.

    If you can specify what percent of lines you want, rather than how many lines, you don't even need to get the file size and you just need to read through the file once. Assuming a header on the first row:

    import pandas as pd
    import random
    p = 0.01  # 1% of the lines
    # keep the header, then take only 1% of lines
    # if random from [0,1] interval is greater than 0.01 the row will be skipped
    df = pd.read_csv(
             filename,
             header=0, 
             skiprows=lambda i: i>0 and random.random() > p
    )
    

    Or, if you want to take every nth line:

    n = 100  # every 100th line = 1% of the lines
    df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)
    
    0 讨论(0)
提交回复
热议问题