R - Using data.table to efficiently test rolling conditions across multiple rows and columns

孤人 提交于 2019-12-06 09:39:07

For your first question:

This'll get the running sum for years that are not necessarily in the dataset as well (as you requested just underneath the two points). The idea is to first generate all combinations of event and year - even the ones which doesn't exist in the dataset. This can be accomplished by the function CJ (for crossjoin). This'll, for each event, create all year.

setkey(dt, event, year)
d1 = CJ(event=unique(dt$event), year=min(dt$year):max(dt$year))

Now, we join back with dt to fill the missing values for V1 with NA.

d1 = dt[d1]

Now we've a dataset with all combinations of event and year. From here, we've to now find a way to perform the rolling sum. For this, we create, yet again, another dataset, which contains all the previous 10 years, for each year, as follows:

window_size = 10L
d2 = d1[, list(window = seq(year-window_size, year-1L, by=1L)), by="event,year"]

For each "event,year", we create a new column window, that'll generate the previous 10 years.

Now, all we've to do is to set the key columns appropriately and perform a join to get the corresponding "V1" values.

setkey(d2, event, window) ## note the join here is on "event, window"
setkey(d1, event, year)

ans = d1[d2]

Now, we've the values of "V1" for each "event,window" combination. All we've to do is aggregate by "event,year.1" ("year.1" was previously "year", and "year" in ans was previously "window"). Here, we take care of the condition that if any of the years is < 1980, then the sum should be NA. This is done by using a small hack that TRUE | NA = TRUE and FALSE | NA = NA.

q1 = ans[, sum(V1, na.rm=TRUE) * (!any(year < 1980) | NA), by="event,year.1"]

q1[event == "K" & year.1 == "2005"]
#    event year.1 V1
# 1:     K   2005 25

For your second question:

Repeat the same as above with window_size = 15L instead of 10L and get up until ans. Then, we can do:

q2 = ans[!is.na(V1)][, .N, by="event,year.1"]

q2[event == "A" & year.1 == 1997]
#    event year.1  N
# 1:     A   1997 14

This is correct because dt has all years from 1982-1995, and 1996 is missing and therefore not counted => N=14, as it should be.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!