Efficient pandas/numpy function for time since change

前端 未结 2 1193
迷失自我
迷失自我 2021-01-21 04:59

Given a Series , I would like to efficiently compute how many observations have passed since there was a change. Here is a simple example:

ser = pd.         


        
相关标签:
2条回答
  • 2021-01-21 05:32

    Here's one NumPy approach -

    def array_cumcount(a):
        idx = np.flatnonzero(a[1:] != a[:-1])+1
        shift_arr = np.ones(a.size,dtype=int)
        shift_arr[0] = 0
    
        if len(idx)>=1:
            shift_arr[idx[0]] = -idx[0]+1
            shift_arr[idx[1:]] = -idx[1:] + idx[:-1] + 1
        return shift_arr.cumsum()
    

    Sample run -

    In [583]: ser = pd.Series([1.2,1.2,1.2,1.2,2,2,2,4,3,3,3,3])
    
    In [584]: array_cumcount(ser.values)
    Out[584]: array([0, 1, 2, 3, 0, 1, 2, 0, 0, 1, 2, 3])
    

    Runtime test -

    In [601]: ser = pd.Series(np.random.randint(0,3,(10000)))
    
    # @Psidom's soln
    In [602]: %timeit ser.groupby(ser).cumcount()
    1000 loops, best of 3: 729 µs per loop
    
    In [603]: %timeit array_cumcount(ser.values)
    10000 loops, best of 3: 85.3 µs per loop
    
    In [604]: ser = pd.Series(np.random.randint(0,3,(1000000)))
    
    # @Psidom's soln
    In [605]: %timeit ser.groupby(ser).cumcount()
    10 loops, best of 3: 30.1 ms per loop
    
    In [606]: %timeit array_cumcount(ser.values)
    100 loops, best of 3: 11.7 ms per loop
    
    0 讨论(0)
  • 2021-01-21 05:46

    You can use groupby.cumcount:

    ser.groupby(ser).cumcount()
    
    #0    0
    #1    1
    #2    2
    #3    3
    #4    0
    #5    1
    #6    2
    #7    0
    #8    0
    #dtype: int64
    
    0 讨论(0)
提交回复
热议问题