I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but whe
No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply
/transform
in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for
loops which operate at C speed and give you better performance.
The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.
In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.
data = {'pk' : np.random.choice(10, 1000)}
data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)})
df = pd.DataFrame(data)
g = df.groupby('pk')
c = ['Val{}'.format(i) for i in range(100)]
transform
+ sub
+ shift
→ diff
Your first line of code can be replaced with a simple diff
statement:
v1 = df.groupby('pk')[c].diff().fillna(0)
Sanity Check
v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
np.allclose(v1, v2)
True
Performance
%timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
10 loops, best of 3: 44.3 ms per loop
%timeit df.groupby('pk')[c].diff(-1).fillna(0)
100 loops, best of 3: 9.63 ms per loop
As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index()
+ [val_cols]
call if your groupby statement is not considering pk
as the index:
g = df.groupby('pk', as_index=False)
Your second line of code then reduces to:
v3 = g[c].rolling(4).mean().shift(1)
Sanity Check
g2 = df.groupby('pk')
v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c]
np.allclose(v3.fillna(0), v4.fillna(0))
True
Performance
%timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c]
10 loops, best of 3: 46.5 ms per loop
%timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1)
10 loops, best of 3: 41.7 ms per loop
Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.
While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.
In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.
To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum
is faster than pd.DataFrame.sum
, and np.where
is faster than pd.DataFrame.where
, and so on).
Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.
In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask or spark, both of which offer high performance distributed computational frameworks for working with big data.