Improving the performance of pandas groupby

后端 未结 1 644
时光取名叫无心
时光取名叫无心 2021-02-01 03:51

I have a machine learning application written in Python which includes a data processing step. When I wrote it, I initially did the data processing on Pandas DataFrames, but whe

相关标签:
1条回答
  • 2021-02-01 04:48

    No, I don't think you should give up on pandas. There's definitely better ways to do what you're trying to. The trick is to avoid apply/transform in any form as much as possible. Avoid them like the plague. They're basically implemented as for loops, so you might as well directly use python for loops which operate at C speed and give you better performance.

    The real speed gain is where you get rid of the loops and use pandas' functions that implicitly vectorise their operations. For example, your first line of code can be simplified greatly, as I show you soon.

    In this post I outline the setup process, and then, for each line in your question, offer an improvement, along with a side-by-side comparison of the timings and correctness.

    Setup

    data = {'pk' : np.random.choice(10, 1000)} 
    data.update({'Val{}'.format(i) : np.random.randn(1000) for i in range(100)})
    
    df = pd.DataFrame(data)
    
    g = df.groupby('pk')
    c = ['Val{}'.format(i) for i in range(100)]
    

    transform + sub + shiftdiff

    Your first line of code can be replaced with a simple diff statement:

    v1 = df.groupby('pk')[c].diff().fillna(0)
    

    Sanity Check

    v2 = df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
    
    np.allclose(v1, v2)
    True
    

    Performance

    %timeit df.groupby('pk')[c].transform(lambda x: x - x.shift(1)).fillna(0)
    10 loops, best of 3: 44.3 ms per loop
    
    %timeit df.groupby('pk')[c].diff(-1).fillna(0)
    100 loops, best of 3: 9.63 ms per loop
    

    Removing redundant indexing operations

    As far as your second line of code is concerned, I don't see too much room for improvement, although you can get rid of the reset_index() + [val_cols] call if your groupby statement is not considering pk as the index:

    g = df.groupby('pk', as_index=False)
    

    Your second line of code then reduces to:

    v3 = g[c].rolling(4).mean().shift(1)
    

    Sanity Check

    g2 = df.groupby('pk')
    v4 = g2[c].rolling(4).mean().shift(1).reset_index()[c]
    
    np.allclose(v3.fillna(0), v4.fillna(0))
    True
    

    Performance

    %timeit df.groupby('pk')[c].rolling(4).mean().shift(1).reset_index()[c]
    10 loops, best of 3: 46.5 ms per loop
    
    %timeit df.groupby('pk', as_index=False)[c].rolling(4).mean().shift(1)
    10 loops, best of 3: 41.7 ms per loop
    

    Note that timings vary on different machines, so make sure you test your code thoroughly to make sure there is indeed an improvement on your data.

    While the difference this time isn't as much, you can appreciate the fact that there are improvements that you can make! This could possibly make a much larger impact for larger data.


    Afterword

    In conclusion, most operations are slow because they can be sped up. The key is to get rid of any approach that does not use vectorization.

    To this end, it is sometimes beneficial to step out of pandas space and step into numpy space. Operations on numpy arrays or using numpy tend to be much faster than pandas equivalents (for example, np.sum is faster than pd.DataFrame.sum, and np.where is faster than pd.DataFrame.where, and so on).

    Sometimes, loops cannot be avoided. In which case, you can create a basic looping function which you can then vectorise using numba or cython. Examples of that are here at Enhancing Performance, straight from the horses mouth.

    In still other cases, your data is just too big to reasonably fit into numpy arrays. In this case, it would be time to give up and switch to dask or spark, both of which offer high performance distributed computational frameworks for working with big data.

    0 讨论(0)
提交回复
热议问题