Computing diffs within groups of a dataframe

前端 未结 6 385
你的背包
你的背包 2020-11-30 19:21

Say I have a dataframe with 3 columns: Date, Ticker, Value (no index, at least to start with). I have many dates and many tickers, but each (ticker, date) tupl

相关标签:
6条回答
  • 2020-11-30 19:39

    I know this is an old question, so I'm assuming this functionality didn't exist at the time. But for those with this question now, this solution works well:

    df.sort_values(['ticker', 'date'], inplace=True)
    df['diffs'] = df.groupby('ticker')['value'].diff()
    

    In order to return to the original order, you can the use

    df.sort_index(inplace=True)
    
    0 讨论(0)
  • 2020-11-30 19:43

    wouldn't be just easier to do what yourself describe, namely

    df.sort(['ticker', 'date'], inplace=True)
    df['diffs'] = df['value'].diff()
    

    and then correct for borders:

    mask = df.ticker != df.ticker.shift(1)
    df['diffs'][mask] = np.nan
    

    to maintain the original index you may do idx = df.index in the beginning, and then at the end you can do df.reindex(idx), or if it is a huge dataframe, perform the operations on

    df.filter(['ticker', 'date', 'value'])
    

    and then join the two dataframes at the end.

    edit: alternatively, ( though still not using groupby )

    df.set_index(['ticker','date'], inplace=True)
    df.sort_index(inplace=True)
    df['diffs'] = np.nan 
    
    for idx in df.index.levels[0]:
        df.diffs[idx] = df.value[idx].diff()
    

    for

       date ticker  value
    0    63      C   1.65
    1    88      C  -1.93
    2    22      C  -1.29
    3    76      A  -0.79
    4    72      B  -1.24
    5    34      A  -0.23
    6    92      B   2.43
    7    22      A   0.55
    8    32      A  -2.50
    9    59      B  -1.01
    

    this will produce:

                 value  diffs
    ticker date              
    A      22     0.55    NaN
           32    -2.50  -3.05
           34    -0.23   2.27
           76    -0.79  -0.56
    B      59    -1.01    NaN
           72    -1.24  -0.23
           92     2.43   3.67
    C      22    -1.29    NaN
           63     1.65   2.94
           88    -1.93  -3.58
    
    0 讨论(0)
  • 2020-11-30 19:43
    # Make sure your data is sorted properly
    df = df.sort_values(by=['group_var', 'value'])
    
    # only take diffs where next row is of the same group
    df['diffs'] = np.where(df.group_var == df.group_var.shift(1), df.value.diff(), 0)
    

    Explanation:

    0 讨论(0)
  • 2020-11-30 19:44

    Ok. Lots of thinking about this, and I think this is my favorite combination of the solutions above and a bit of playing around. Original data lives in df:

    df.sort(['ticker', 'date'], inplace=True)
    
    # for this example, with diff, I think this syntax is a bit clunky
    # but for more general examples, this should be good.  But can we do better?
    df['diffs'] = df.groupby(['ticker'])['value'].transform(lambda x: x.diff()) 
    
    df.sort_index(inplace=True)
    

    This will accomplish everything I want. And what I really like is that it can be generalized to cases where you want to apply a function more intricate than diff. In particular, you could do things like lambda x: pd.rolling_mean(x, 20, 20) to make a column of rolling means where you don't need to worry about each ticker's data being corrupted by that of any other ticker (groupby takes care of that for you...).

    So here's the question I'm left with...why doesn't the following work for the line that starts df['diffs']:

    df['diffs'] = df.groupby[('ticker')]['value'].transform(np.diff)
    

    when I do that, I get a diffs column full of 0's. Any thoughts on that?

    0 讨论(0)
  • 2020-11-30 19:55

    Here is a solution that builds on what @behzad.nouri wrote, but using pd.IndexSlice:

    df =  df.set_index(['ticker', 'date']).sort_index()[['value']]
    df['diff'] = np.nan
    idx = pd.IndexSlice
    
    for ix in df.index.levels[0]:
        df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'value' ].diff()
    

    For:

    > df
       date ticker  value
    0    63      C   1.65
    1    88      C  -1.93
    2    22      C  -1.29
    3    76      A  -0.79
    4    72      B  -1.24
    5    34      A  -0.23
    6    92      B   2.43
    7    22      A   0.55
    8    32      A  -2.50
    9    59      B  -1.01
    

    It returns:

    > df
                 value  diff
    ticker date             
    A      22     0.55   NaN
           32    -2.50 -3.05
           34    -0.23  2.27
           76    -0.79 -0.56
    B      59    -1.01   NaN
           72    -1.24 -0.23
           92     2.43  3.67
    C      22    -1.29   NaN
           63     1.65  2.94
           88    -1.93 -3.58
    
    0 讨论(0)
  • 2020-11-30 19:56

    You can use pivot to convert the dataframe into date-ticker table, here is an example:

    create the test data first:

    import pandas as pd
    import numpy as np
    import random
    from itertools import product
    
    dates = pd.date_range(start="2013-12-01",  periods=10).to_native_types()
    ticks = "ABCDEF"
    pairs = list(product(dates, ticks))
    random.shuffle(pairs)
    pairs = pairs[:-5]
    values = np.random.rand(len(pairs))
    
    dates, ticks = zip(*pairs)
    df = pd.DataFrame({"date":dates, "tick":ticks, "value":values})
    

    convert the dataframe by pivot format:

    df2 = df.pivot(index="date", columns="tick", values="value")
    

    fill NaN:

    df2 = df2.fillna(method="ffill")
    

    call diff() method:

    df2.diff()
    

    here is what df2 looks like:

    tick               A         B         C         D         E         F
    date                                                                  
    2013-12-01  0.077260  0.084008  0.711626  0.071267  0.811979  0.429552
    2013-12-02  0.106349  0.141972  0.457850  0.338869  0.721703  0.217295
    2013-12-03  0.330300  0.893997  0.648687  0.628502  0.543710  0.217295
    2013-12-04  0.640902  0.827559  0.243816  0.819218  0.543710  0.190338
    2013-12-05  0.263300  0.604084  0.655723  0.299913  0.756980  0.135087
    2013-12-06  0.278123  0.243264  0.907513  0.723819  0.506553  0.717509
    2013-12-07  0.960452  0.243264  0.357450  0.160799  0.506553  0.194619
    2013-12-08  0.670322  0.256874  0.637153  0.582727  0.628581  0.159636
    2013-12-09  0.226519  0.284157  0.388755  0.325461  0.957234  0.810376
    2013-12-10  0.958412  0.852611  0.472012  0.832173  0.957234  0.723234
    
    0 讨论(0)
提交回复
热议问题