I am trying to speed up the sum for several big multilevel dataframes.
Here is a sample:
df1 = mul_df(5000,30,400) # mul_df to create a big multileve
In the following my Observations: - First, I reproduce your test case and come to other results. Using numexpression under the hood of Pandas increases performance significantly. - Second, I sort one of the four DataFrames in descending order and rerun all cases. The performance breaks, and additionally, (as expected) numexpression evaluation on Pandas DataFrames leads to wrong results.
This case reproduces your case. The only difference is, that I create copies of the inital DataFrame instance. So there is nothing shared. There are different objects (ids) in use to make sure, that numexpression can deal with it.
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy()
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
564 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
152 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
66.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
True
Here I sort one of the DataFrames in descending order, therefore changing the index and reshuffling the rows in the dataframe internal numpy array.
import itertools
import numpy as np
import pandas as pd
def mul_df(level1_rownum, level2_rownum, col_num, data_ty='float32'):
''' create multilevel dataframe, for example: mul_df(4,2,6)'''
index_name = ['STK_ID','RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = [['A'+str(x).zfill(4)]*level2_rownum for x in range(level1_rownum)]
first_level_dt = list(itertools.chain(*first_level_dt)) #flatten the list
second_level_dt = ['B'+str(x).zfill(3) for x in range(level2_rownum)]*level1_rownum
dt = pd.DataFrame(np.random.randn(level1_rownum*level2_rownum, col_num), columns=col_name, dtype = data_ty)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
rst = dt.set_index(index_name, drop=True, inplace=False)
return rst
df1 = mul_df(5000,30,400)
df2, df3, df4 = df1.copy(), df1.copy(), df1.copy().sort_index(ascending=False)
pd.options.compute.use_numexpr = False
%%timeit
df1 + df2 + df3 + df4
1.36 s ± 67.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pd.options.compute.use_numexpr = True
%%timeit
df1 + df2 + df3 + df4
928 ms ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import numexpr as ne
%%timeit
pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32')
68 ms ± 2.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
(df1 + df2 + df3 + df4).equals(pd.DataFrame(ne.evaluate('df1 + df2 + df3 + df4'), columns=df1.columns, index=df1.index, dtype='float32'))
False
By using numexpr
2 * df1
.method 1: On my machine not so bad (with numexpr
disabled)
In [41]: from pandas.core import expressions as expr
In [42]: expr.set_use_numexpr(False)
In [43]: %timeit df1+df2+df3+df4
1 loops, best of 3: 349 ms per loop
method 2: Using numexpr
(which is by default enabled if numexpr
is installed)
In [44]: expr.set_use_numexpr(True)
In [45]: %timeit df1+df2+df3+df4
10 loops, best of 3: 173 ms per loop
method 3: Direct use of numexpr
In [34]: import numexpr as ne
In [46]: %timeit DataFrame(ne.evaluate('df1+df2+df3+df4'),columns=df1.columns,index=df1.index,dtype='float32')
10 loops, best of 3: 47.7 ms per loop
These speedups are achieved using numexpr
because:
((df1+df2)+df3)+df4
As I hinted above, pandas uses numexpr
under the hood for certain types of ops (in 0.11), e.g. df1 + df2
would be evaluated this way, however the example you are giving here will result in several calls to numexpr
(this is method 2 is faster than method 1.). Using the direct (method 3) ne.evaluate(...)
achieves even more speedups.
Note that in pandas 0.13 (0.12 will be released this week), we are implemented a function pd.eval
which will in effect do exactly what my example above does. Stay tuned (if you are adventurous this will be in master somewhat soon: https://github.com/pydata/pandas/pull/4037)
In [5]: %timeit pd.eval('df1+df2+df3+df4')
10 loops, best of 3: 50.9 ms per loop
Lastly to answer your question, cython
will not help here at all; numexpr
is quite efficient at this type of problem (that said, there are situation where cython is helpful)
One caveat: in order to use the direct Numexpr method the frames should be already aligned (Numexpr operates on the numpy array and doesn't know anything about the indices). also they should be a single dtype