How to stream in and manipulate a large data file in python

前端 未结 2 807
庸人自扰
庸人自扰 2021-02-04 15:20

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1                 


        
相关标签:
2条回答
  • 2021-02-04 15:44

    You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

    import dask.dataframe as dd
    
    df = dd.read_csv('my_file.csv')
    df = df.groupby('Geography')['Count'].sum().to_frame()
    df.to_csv('my_output.csv')
    

    Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

    # Operate on chunks.
    data = []
    for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
        chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
        data.append(chunk)
    
    # Combine the chunked data.
    df = pd.concat(data, ignore_index=True)
    df = df.groupby('Geography')['Count'].sum().to_frame()
    df.to_csv('my_output.csv')
    
    0 讨论(0)
  • 2021-02-04 15:57

    I do like @root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:

    cols = ['Geography','Count']
    df = pd.DataFrame()
    
    chunksize = 2   # adjust it! for example --> 10**5
    for chunk in (pd.read_csv(filename,
                              usecols=cols,
                              chunksize=chunksize)
                 ):
        # merge previously aggregated DF with a new portion of data and aggregate it again
        df = (pd.concat([df,
                         chunk.groupby('Geography')['Count'].sum().to_frame()])
                .groupby(level=0)['Count']
                .sum()
                .to_frame()
             )
    
    df.reset_index().to_csv('c:/temp/result.csv', index=False)
    

    test data:

    Geography,AgeGroup,Gender,Race,Count
    County1,1,M,1,12
    County2,2,M,1,3
    County3,2,M,2,0
    County1,1,M,1,12
    County2,2,M,1,33
    County3,2,M,2,11
    County1,1,M,1,12
    County2,2,M,1,111
    County3,2,M,2,1111
    County5,1,M,1,12
    County6,2,M,1,33
    County7,2,M,2,11
    County5,1,M,1,12
    County8,2,M,1,111
    County9,2,M,2,1111
    

    output.csv:

    Geography,Count
    County1,36
    County2,147
    County3,1122
    County5,24
    County6,33
    County7,11
    County8,111
    County9,1111
    

    PS using this approach will you can process huge files.

    PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first

    I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.

    0 讨论(0)
提交回复
热议问题