MemoryError on large merges with pandas in Python

后端 未结 3 716
日久生厌
日久生厌 2021-01-17 16:01

I\'m using pandas to do an outer merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id which is shared between al

相关标签:
3条回答
  • 2021-01-17 16:15

    pd.concat seems to run out of memory for large dataframes as well, one option is to convert the dfs to matrixes and concat these.

    def concat_df_by_np(df1,df2):
        """
        accepts two dataframes, converts each to a matrix, concats them horizontally and
        uses the index of the first dataframe. This is not a concat by index but simply by
        position, therefore the index of both dataframes should be the same
        """
        dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
                                      index   = df1.index, 
                                      columns = np.concatenate([df1.columns,df2.columns])))
        if (df1.index!=df2.index).any():
           #logging.warning('Indices in concat_df_by_np are not the same')                     
           print ('Indices in concat_df_by_np are not the same')                     
    
    
        return dfout
    

    However, one needs to be careful as this function is not a join but rather a horizontal append while where the indices are ignored

    0 讨论(0)
  • 2021-01-17 16:29

    I think you'll get better performance using a concat (which acts like an outer join):

    dfs = (pd.read_csv(filename).set_index('id') for filename in filenames)
    merged_df = pd.concat(dfs, axis=1)
    

    This means you are doing only one merge operation rather than one for each file.

    0 讨论(0)
  • 2021-01-17 16:35

    I met same error in 32-bit pytwhen using read_csv with 1GB file. Try 64-bit version and hopefully will solve Memory Error problem

    0 讨论(0)
提交回复
热议问题