How to speed up pandas with cython (or numpy)

后端 未结 1 734
一个人的身影
一个人的身影 2020-12-28 08:16

I am trying to use Cython to speed up a Pandas DataFrame computation which is relatively simple: iterating over each row in the DataFrame, add that row to itself and to all

相关标签:
1条回答
  • 2020-12-28 08:46

    If you're just trying to do it faster and not specifically using cython, I'd just do it in plain numpy (about 50x faster).

    def numpy_foo(arr):
        vals = {i: (arr[i, :] + arr[i:, :]).sum(axis=1).tolist()
                for i in range(arr.shape[0])}   
        return vals
    
    %timeit foo(df)
    100 loops, best of 3: 7.2 ms per loop
    
    %timeit numpy_foo(df.values)
    10000 loops, best of 3: 144 µs per loop
    
    foo(df) == numpy_foo(df.values)
    Out[586]: True
    

    Generally speaking, pandas gives you a lot of conveniences relative to numpy, but there are overhead costs. So in situations where pandas isn't really adding anything, you can generally speed things up by doing it in numpy. For another example, see this question I asked which showed a roughly comparable speed difference (about 23x).

    0 讨论(0)
提交回复
热议问题