How to concatenate multiple pandas.DataFrames without running into MemoryError

后端 未结 10 1298
盖世英雄少女心
盖世英雄少女心 2020-12-24 12:19

I have three DataFrames that I\'m trying to concatenate.

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I re

相关标签:
10条回答
  • 2020-12-24 13:13

    I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

    Execute that:

    # write df1 content in file.csv
    df1.to_csv('file.csv', index=False)
    # append df2 content to file.csv
    df2.to_csv('file.csv', mode='a', columns=False, index=False)
    # append df3 content to file.csv
    df3.to_csv('file.csv', mode='a', columns=False, index=False)
    
    # free memory
    del df1, df2, df3
    
    # read all df1, df2, df3 contents
    df = pd.read_csv('file.csv')
    

    If this solution isn't enougth performante, to concat larger files than usually. Do:

    df1.to_csv('file.csv', index=False)
    df2.to_csv('file1.csv', index=False)
    df3.to_csv('file2.csv', index=False)
    
    del df1, df2, df3
    

    Then run bash command:

    cat file1.csv >> file.csv
    cat file2.csv >> file.csv
    cat file3.csv >> file.csv
    

    Or concat csv files in python :

    def concat(file1, file2):
        with open(file2, 'r') as filename2:
            data = file2.read()
        with open(file1, 'a') as filename1:
            file.write(data)
    
    concat('file.csv', 'file1.csv')
    concat('file.csv', 'file2.csv')
    concat('file.csv', 'file3.csv')
    

    After read:

    df = pd.read_csv('file.csv')
    
    0 讨论(0)
  • 2020-12-24 13:17

    Dask might be good option to try for handling large dataframes - Go through Dask Docs

    0 讨论(0)
  • 2020-12-24 13:17

    Another option:

    1) Write df1 to .csv file: df1.to_csv('Big file.csv')

    2) Open .csv file, then append df2:

    with open('Big File.csv','a') as f:
        df2.to_csv(f, header=False)
    

    3) Repeat Step 2 with df3

    with open('Big File.csv','a') as f:
        df3.to_csv(f, header=False)
    
    0 讨论(0)
  • 2020-12-24 13:18

    Kinda taking a guess here, but maybe:

    df1 = pd.concat([df1,df2])
    del df2
    df1 = pd.concat([df1,df3])
    del df3
    

    Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

    More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

    concat_df = pd.DataFrame()
    for i in range(3):
        temp_df = pd.read_csv('foo'+str(i)+'.csv')
        concat_df = pd.concat( [concat_df, temp_df] )
    

    In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

    0 讨论(0)
提交回复
热议问题