Rolling window function for irregular time series that can handle duplicates

问题

I have the following data.frame:

    grp  nr   yr
 1:   A 1.0 2009
 2:   A 2.0 2009
 3:   A 1.5 2009
 4:   A 1.0 2010
 5:   B 3.0 2009
 6:   B 2.0 2010
 7:   B  NA 2011
 8:   C 3.0 2014
 9:   C 3.0 2019
10:   C 3.0 2020
11:   C 4.0 2021

Desired output:

   grp  nr   yr nr_roll_period_3
1    A 1.0 2009               NA
2    A 2.0 2009               NA
3    A 1.5 2009               NA
4    A 1.0 2010               NA
5    B 3.0 2009               NA
6    B 2.0 2010               NA
7    B  NA 2011               NA
8    C 3.0 2014               NA
9    C 3.0 2019               NA
10   C 3.0 2020               NA
11   C 4.0 2021         3.333333

The logic:

I want to calculate a rolling mean for the period of length k (let's say 3), where 3 includes the current month/year/day (by group)
However, this shouldn't calculate anything where there is no 3 consecutive years/months/days
Likewise, whenever there is NA in the column for calculation within this period, the output should be NA.

Currently I have this function:

calculate_rolling_window <-

  function(dt, date_col, calc_col, id, k) {

    require(data.table)

    return(setDT(dt)[
      , paste(calc_col, "roll_period", k, sep = "_") := 
        sapply(get(date_col), function(x) mean(get(calc_col)[between(get(date_col), x - k + 1, x)])),
      by = mget(id)])

  }

It works fine for the regular cases, where there is no duplicates in the date column. However, with duplicates it fails:

    grp  nr   yr nr_roll_period_3
 1:   A 1.0 2009         1.500000
 2:   A 2.0 2009         1.500000
 3:   A 1.5 2009         1.500000
 4:   A 1.0 2010         1.375000
 5:   B 3.0 2009               NA
 6:   B 2.0 2010               NA
 7:   B  NA 2011               NA
 8:   C 3.0 2014               NA
 9:   C 3.0 2019               NA
10:   C 3.0 2020               NA
11:   C 4.0 2021         3.333333

Any ideas on how to handle this? No need for exclusively data.table approach.

回答1:

This can be solved by grouping in a non-equi join to aggregate over a rolling window of length k, filtering for k consecutive years, and an update join:

library(data.table)
k <- 3L
# group by join parameters of a non-equi join
mDT <- setDT(DT)[.(grp = grp, upper = yr, lower = yr - k), 
                 on = .(grp, yr <= upper, yr > lower), 
                 .(uniqueN(x.yr), mean(nr)), by = .EACHI]
# update join with filtered intermediate result
DT[mDT[V1 == k], on = .(grp, yr), paste0("nr_roll_period_", k) := V2]
DT

which returns OP's expected result:

    grp  nr   yr nr_roll_period
 1:   A 1.0 2009             NA
 2:   A 2.0 2009             NA
 3:   A 1.5 2009             NA
 4:   A 1.0 2010             NA
 5:   B 3.0 2009             NA
 6:   B 2.0 2010             NA
 7:   B  NA 2011             NA
 8:   C 3.0 2014             NA
 9:   C 3.0 2019             NA
10:   C 3.0 2020             NA
11:   C 4.0 2021       3.333333

The intermediate result mDT contains the rolling mean V2 over k periods and the count of unique/distinct years V1 within each period. It is created by a non-equi join of DT with a data.table containing the upper and lower bounds which is created on-the-fly by .(grp = grp, upper = yr, lower = yr - k).

mDT

    grp   yr   yr V1       V2
 1:   A 2009 2006  1 1.500000
 2:   A 2009 2006  1 1.500000
 3:   A 2009 2006  1 1.500000
 4:   A 2010 2007  2 1.375000
 5:   B 2009 2006  1 3.000000
 6:   B 2010 2007  2 2.500000
 7:   B 2011 2008  3       NA
 8:   C 2014 2011  1 3.000000
 9:   C 2019 2016  1 3.000000
10:   C 2020 2017  2 3.000000
11:   C 2021 2018  3 3.333333

This is filtered for rows which contain exactly k distinct years:

mDT[V1 == k]

   grp   yr   yr V1       V2
1:   B 2011 2008  3       NA
2:   C 2021 2018  3 3.333333

Finally, this is joined with DT to append the new column to DT.

Note, that mean() returns NA by default if there is an NA in the input data.

Data

library(data.table)
DT <- fread(text = "rn    grp  nr   yr
 1:   A 1.0 2009
 2:   A 2.0 2009
 3:   A 1.5 2009
 4:   A 1.0 2010
 5:   B 3.0 2009
 6:   B 2.0 2010
 7:   B  NA 2011
 8:   C 3.0 2014
 9:   C 3.0 2019
10:   C 3.0 2020
11:   C 4.0 2021", drop = 1L)

来源：https://stackoverflow.com/questions/52725965/rolling-window-function-for-irregular-time-series-that-can-handle-duplicates

标签

date

time-series

rolling-computation