I have three DataFrames that I\'m trying to concatenate.
concat_df = pd.concat([df1, df2, df3])
This results in a MemoryError. How can I re
I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.
Execute that:
# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)
# free memory
del df1, df2, df3
# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')
If this solution isn't enougth performante, to concat larger files than usually. Do:
df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)
del df1, df2, df3
Then run bash command:
cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv
Or concat csv files in python :
def concat(file1, file2):
with open(file2, 'r') as filename2:
data = file2.read()
with open(file1, 'a') as filename1:
file.write(data)
concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')
After read:
df = pd.read_csv('file.csv')
Dask might be good option to try for handling large dataframes - Go through Dask Docs
Another option:
1) Write df1
to .csv file: df1.to_csv('Big file.csv')
2) Open .csv file, then append df2
:
with open('Big File.csv','a') as f:
df2.to_csv(f, header=False)
3) Repeat Step 2 with df3
with open('Big File.csv','a') as f:
df3.to_csv(f, header=False)
Kinda taking a guess here, but maybe:
df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3
Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.
More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):
concat_df = pd.DataFrame()
for i in range(3):
temp_df = pd.read_csv('foo'+str(i)+'.csv')
concat_df = pd.concat( [concat_df, temp_df] )
In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.