Pandas: Difference to previous value

前端 未结 2 866
伪装坚强ぢ
伪装坚强ぢ 2021-01-18 11:26

Given a Pandas Data Frame that looks like this

GROUP   VALUE    MASK
  1        5     false
  2       10     false
  2       20     false
  1        7     tr         


        
相关标签:
2条回答
  • 2021-01-18 11:53

    use groupby, diff, 'MASK'

    pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()],
              axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
    

    0 讨论(0)
  • 2021-01-18 12:02

    Actually the bottleneck is groupby. You don't actually need to use groupby for this specific problem. To sort the dataframe by GROUP, perform diff on the sorted dataframe and filter by MASK should be okay. We must use kind='mergesort' to keep the order unchanged within the group before and after sorting,

    Assume MASK is always False for the first element of each group ( since the first element is meaningless for difference operation ), you can use this

    pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
    

    Performance Tests:

    MAXN = 200000
    GROUPS = 10000
    df = pd.DataFrame({"GROUP": np.ceil(np.random.rand(MAXN)*GROUPS), "VALUE": np.ceil(np.random.rand(MAXN)*10000), "MASK":np.floor(np.random.rand(MAXN)*2).astype("bool")})
    
    %timeit t1 = pd.concat([df.VALUE, df.groupby('GROUP').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
    # 1 loop, best of 3: 1.28 s per loop
    
    %timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
    #10 loops, best of 3: 63.1 ms per loop
    
    #MAXN = 2000000
    #GROUPS = 1000000
    %timeit t2 = pd.concat([df.VALUE, df.sort_values(by="GROUP", kind='mergesort').VALUE.diff()], axis=1, keys=['VALUE', 'DIFF'])[df.MASK]
    #1 loop, best of 3: 1.24 s per loop
    
    0 讨论(0)
提交回复
热议问题