I\'m using pandas to do an outer
merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id
which is shared between al
pd.concat
seems to run out of memory for large dataframes as well, one option is to convert the dfs to matrixes and concat these.
def concat_df_by_np(df1,df2):
"""
accepts two dataframes, converts each to a matrix, concats them horizontally and
uses the index of the first dataframe. This is not a concat by index but simply by
position, therefore the index of both dataframes should be the same
"""
dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
index = df1.index,
columns = np.concatenate([df1.columns,df2.columns])))
if (df1.index!=df2.index).any():
#logging.warning('Indices in concat_df_by_np are not the same')
print ('Indices in concat_df_by_np are not the same')
return dfout
However, one needs to be careful as this function is not a join but rather a horizontal append while where the indices are ignored
I think you'll get better performance using a concat (which acts like an outer join):
dfs = (pd.read_csv(filename).set_index('id') for filename in filenames)
merged_df = pd.concat(dfs, axis=1)
This means you are doing only one merge operation rather than one for each file.
I met same error in 32-bit pytwhen using read_csv with 1GB file. Try 64-bit version and hopefully will solve Memory Error problem