How to stream in and manipulate a large data file in python

前端 未结 2 806
庸人自扰
庸人自扰 2021-02-04 15:20

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1                 


        
2条回答
  •  清酒与你
    2021-02-04 15:44

    You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

    import dask.dataframe as dd
    
    df = dd.read_csv('my_file.csv')
    df = df.groupby('Geography')['Count'].sum().to_frame()
    df.to_csv('my_output.csv')
    

    Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock. You may want to experiment with the chunksize parameter.

    # Operate on chunks.
    data = []
    for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
        chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
        data.append(chunk)
    
    # Combine the chunked data.
    df = pd.concat(data, ignore_index=True)
    df = df.groupby('Geography')['Count'].sum().to_frame()
    df.to_csv('my_output.csv')
    

提交回复
热议问题