Computing np.diff in Pandas after using groupby leads to unexpected result

后端 未结 2 1975
半阙折子戏
半阙折子戏 2020-11-30 03:32

I\'ve got a dataframe, and I\'m trying to append a column of sequential differences to it. I have found a method that I like a lot (and generalizes well for my use case).

相关标签:
2条回答
  • 2020-11-30 03:37

    You can see that the Series .diff() method is different to np.diff():

    In [11]: data.value.diff()  # Note the NaN
    Out[11]: 
    0         NaN
    1   -0.410069
    2    0.523736
    3   -0.114340
    4   -0.014955
    5   -0.090033
    6   -0.125686
    7    0.414622
    8   -0.319616
    Name: value, dtype: float64
    
    In [12]: np.diff(data.value.values)  # the values array of the column
    Out[12]: 
    array([-0.41006867,  0.52373625, -0.11434009, -0.01495459, -0.09003298,
           -0.12568619,  0.41462233, -0.31961629])
    
    In [13]: np.diff(data.value) # on the column (Series)
    Out[13]: 
    0   NaN
    1     0
    2     0
    3     0
    4     0
    5     0
    6     0
    7     0
    8   NaN
    Name: value, dtype: float64
    
    In [14]: np.diff(data.value.index)  # er... on the index
    Out[14]: Int64Index([8], dtype=int64)
    
    In [15]: np.diff(data.value.index.values)
    Out[15]: array([1, 1, 1, 1, 1, 1, 1, 1])
    
    0 讨论(0)
  • 2020-11-30 03:53

    Nice easy to reproduce example!! more questions should be like this!

    Just pass a lambda to transform (this is tantamount to passing afuncton object, e.g. np.diff (or Series.diff) directly. So this equivalent to data1/data2

    In [32]: data3['diffs'] = data3.groupby('ticker')['value'].transform(Series.diff)
    
    In [34]: data3.sort_index(inplace=True)
    
    In [25]: data3
    Out[25]: 
             date    ticker     value     diffs
    0  2013-10-03  ticker_2  0.435995  0.015627
    1  2013-10-04  ticker_2  0.025926 -0.410069
    2  2013-10-02  ticker_1  0.549662       NaN
    3  2013-10-01  ticker_0  0.435322       NaN
    4  2013-10-02  ticker_2  0.420368  0.120713
    5  2013-10-03  ticker_0  0.330335 -0.288936
    6  2013-10-04  ticker_1  0.204649 -0.345014
    7  2013-10-02  ticker_0  0.619271  0.183949
    8  2013-10-01  ticker_2  0.299655       NaN
    
    [9 rows x 4 columns]
    

    I believe that np.diff doesn't follow numpy's own unfunc guidelines to process array inputs (whereby it tries various methods to coerce input and send output, e.g. __array__ on input __array_wrap__ on output). I am not really sure why, see a bit more info here. So bottom line is that np.diff is not dealing with the index properly and doing its own calculation (which in this case is wrong).

    Pandas has a lot of methods where they don't just call the numpy function, mainly because they handle different dtypes, handle nans, and in this case, handle 'special' diffs. e.g. you can pass a time frequency to a datelike-index where it calculates how many n to actually diff.

    0 讨论(0)
提交回复
热议问题