I have three DataFrames that I\'m trying to concatenate.
concat_df = pd.concat([df1, df2, df3])
This results in a MemoryError. How can I re
Similar to what @glegoux suggests, also pd.DataFrame.to_csv
can write in append mode, so you can do something like:
df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)
del df1, df2, df3
df_concat = pd.read_csv(filename)
I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.
My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.
The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.
With such huge data, performance is an issue.
csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.
Here the code:
import numpy as np
import pandas as pd
# a DataFrame factory:
dfs=[]
for i in range(10):
dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))
# a csv solution
def bycsv(dfs):
md,hd='w',True
for df in dfs:
df.to_csv('df_all.csv',mode=md,header=hd,index=None)
md,hd='a',False
#del dfs
df_all=pd.read_csv('df_all.csv',index_col=None)
os.remove('df_all.csv')
return df_all
Better solutions :
def byHDF(dfs):
store=pd.HDFStore('df_all.h5')
for df in dfs:
store.append('df',df,data_columns=list('0123'))
#del dfs
df=store.select('df')
store.close()
os.remove('df_all.h5')
return df
def bypickle(dfs):
c=[]
with open('df_all.pkl','ab') as f:
for df in dfs:
pickle.dump(df,f)
c.append(len(df))
#del dfs
with open('df_all.pkl','rb') as f:
df_all=pickle.load(f)
offset=len(df_all)
df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))
for size in c[1:]:
df=pickle.load(f)
df_all.iloc[offset:offset+size]=df.values
offset+=size
os.remove('df_all.pkl')
return df_all
For homogeneous dataframes, we can do even better :
def byhand(dfs):
mtot=0
with open('df_all.bin','wb') as f:
for df in dfs:
m,n =df.shape
mtot += m
f.write(df.values.tobytes())
typ=df.values.dtype
#del dfs
with open('df_all.bin','rb') as f:
buffer=f.read()
data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
df_all=pd.DataFrame(data=data,columns=list(range(n)))
os.remove('df_all.bin')
return df_all
And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.
In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s
In [93]: %time x=byHDF(dfs)
Wall time: 547 ms
In [94]: %time v=bypickle(dfs)
Wall time: 219 ms
In [95]: %time y=byhand(dfs)
Wall time: 109 ms
A check :
In [195]: (x.values==w.values).all()
Out[195]: True
In [196]: (x.values==v.values).all()
Out[196]: True
In [197]: (x.values==y.values).all()
Out[196]: True
Of course all of that must be improved and tuned to fit your problem.
For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle
.
I can edit it if you give more information on your data structure and size if you want. Beautiful question !
You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.
# name of store
fname = 'my_store'
with pd.get_store(fname) as store:
# save individual dfs to store
for df in [df1, df2, df3, df_foo]:
store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending
# access the store as a single df
df = store.select('df', where = ['A>2']) # change where condition as required (see documentation for examples)
# Do other stuff with df #
# close the store when you're done
os.remove(fname)
While writing to hard disk, df.to_csv
throws an error for columns=False
.
The below solutions works fine:
# write df1 to hard disk as file.csv
train1.to_csv('file.csv', index=False)
# append df2 to file.csv
train2.to_csv('file.csv', mode='a', header=False, index=False)
# read the appended csv as df
train = pd.read_csv('file.csv')
I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.
There are memory limits defined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.
I upgraded to 64 bit Python, and haven't had a memory error since then!
Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?