R data.table grouping for lagged regression

后端 未结 3 1670
故里飘歌
故里飘歌 2020-12-05 11:59

table with data (its a data.table object) that looks like the following :

      date         stock_id logret
   1: 2011-01-01        1  0.001
   2: 2011-01-0         


        
相关标签:
3条回答
  • 2020-12-05 12:00

    Just some additional notes due to Alex's comment. The reason you have difficulties understanding what's going on here is that a lot of things are done within one line. So it's always a good idea to break things down.

    What do we actually want? We want a new column lagret and the syntax to add a new column in data.table is the following:

    DT[, lagret := xxx]
    

    where xxx has to be filled up with whatever you want to have in column lagret. So if we just want a new column that gives us the rows, we could just call

    DT[, lagret := seq(from=1, to=nrow(DT))]
    

    Here, we actually want the lagged value of logret, but we have to consider that there are many stocks in here. That's why we do a self-join, i.e. we join the data.table DT with itself by the columns stock_id and date, but since we want the previous value of each stock, we use date-1. Note that we have to set the keys first to do such a join:

    setkeyv(DT,c('stock_id','date'))
    DT[list(stock_id,date-1)]
        stock_id       date logret
     1:        1 2010-12-31     NA
     2:        1 2011-01-01  0.001
     3:        1 2011-01-02  0.003
     4:        1 2011-01-03  0.005
     5:        1 2011-01-04  0.007
     6:        1 2011-01-05  0.009
    ...
    

    As you can see, we now have what we want. logret is now lagged by one period. But we actually want that in a new column lagret in DT, so we just get that column by calling [[3L]] (this means nothing else then get me the third column) and name this new column lagret:

    DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-04        1  0.007  0.005
     5: 2011-01-05        1  0.009  0.007
    ...
    

    This is already the correct solution. In this simple case, we do not need roll=TRUE because there are no gaps in the dates. However, in a more realistic example (as mentioned above, for instance when we have weekends), there might be gaps. So let's make such a realistic example by just deleting two days in the DT for the first stock:

    DT <- DT[-c(4, 5)]
    setkeyv(DT,c('stock_id','date'))
    DT[,lagret:=DT[list(stock_id,date-1),logret][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-06        1  0.011     NA
     5: 2011-01-01        2  0.013     NA
    ...
    

    As you can see, the problem is now that we don't have a value for the 6th of January. That's why we use roll=TRUE:

    DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
              date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-06        1  0.011  0.005
     5: 2011-01-01        2  0.013     NA
    ...
    

    Just have a look on the documentation on how roll=TRUE works exactly. In a nutshell: If it can't find the previous value (here logret for the 5th of January), it just takes the last available one (here from the 3rd of January).

    0 讨论(0)
  • 2020-12-05 12:08

    Update:

    In the current development version of data.table, v1.9.5, shift() is implemented #965, which takes two types at the moment type = "lag" (default) and type = "lead". See ?shift for more on usage.

    With this, we can simply do:

    # type="lag" may be omitted, as it is the default.
    require(data.table) ## 1.9.5+
    DT[, lagret := shift(logret, 1L, type="lag"), by=stock_id]
    #           date stock_id logret lagret
    #  1: 2011-01-01        1  0.001     NA
    #  2: 2011-01-02        1  0.003  0.001
    #  3: 2011-01-03        1  0.005  0.003
    #  4: 2011-01-04        1  0.007  0.005
    #  5: 2011-01-05        1  0.009  0.007
    #  6: 2011-01-06        1  0.011  0.009
    #  7: 2011-01-01        2  0.013     NA
    #  8: 2011-01-02        2  0.015  0.013
    #  9: 2011-01-03        2  0.017  0.015
    # 10: 2011-01-04        2  0.019  0.017
    # 11: 2011-01-05        2  0.021  0.019
    # 12: 2011-01-06        2  0.023  0.021
    # 13: 2011-01-01        3  0.025     NA
    # 14: 2011-01-02        3  0.027  0.025
    # 15: 2011-01-03        3  0.029  0.027
    # 16: 2011-01-04        3  0.031  0.029
    # 17: 2011-01-05        3  0.033  0.031
    # 18: 2011-01-06        3  0.035  0.033
    
    0 讨论(0)
  • 2020-12-05 12:08

    Thanks to Matthew Dowle's advice, I was able to use the following :

    DT[,lagret:=DT[list(stock_id,date-1),logret,roll=TRUE][[3L]]]
    

    Results are :

                 date stock_id logret lagret
     1: 2011-01-01        1  0.001     NA
     2: 2011-01-02        1  0.003  0.001
     3: 2011-01-03        1  0.005  0.003
     4: 2011-01-04        1  0.007  0.005
     5: 2011-01-05        1  0.009  0.007
     6: 2011-01-06        1  0.011  0.009
     7: 2011-01-01        2  0.013     NA
     8: 2011-01-02        2  0.015  0.013
     9: 2011-01-03        2  0.017  0.015
    10: 2011-01-04        2  0.019  0.017
    11: 2011-01-05        2  0.021  0.019
    12: 2011-01-06        2  0.023  0.021
    13: 2011-01-01        3  0.025     NA
    14: 2011-01-02        3  0.027  0.025
    15: 2011-01-03        3  0.029  0.027
    16: 2011-01-04        3  0.031  0.029
    17: 2011-01-05        3  0.033  0.031
    18: 2011-01-06        3  0.035  0.033
    
    0 讨论(0)
提交回复
热议问题