Lagged lists in data.table R [duplicate]

帅比萌擦擦* 提交于 2019-12-12 17:40:22

问题


shift in R's data.table is great for time series and time window stuff. But columns of lists don't lag the same way that columns of other elements do. In the code below, gearLag lead/lags gear correctly, but gearsListLag isn't lagging gearsList, instead, shift is operating within gearsList to lag the element on itself in the same row.

dt <- data.table(mtcars)[,.(gear, carb, cyl)]
###  Make col of lists
dt[,carbList:=list(list(unique(carb))), by=.(cyl, gear)]
###  Now I want to lag/lead col of lists
dt[,.(carb, carbLag=shift(carb)
    , carbList, carbListLag=shift(carbList, type="lead")), by=cyl] 

    cyl carb carbLag carbList carbListLag
 1:   6    4      NA         4           NA
 2:   6    4       4         4           NA
 3:   6    1       4         1           NA <-- should be 4 here, not NA
 4:   6    1       1         1           NA
 5:   6    4       1         4           NA
 6:   6    4       4         4           NA
 7:   6    6       4         6           NA
 8:   4    1      NA       1,2         2,NA
 9:   4    2       1       1,2         2,NA
10:   4    2       2       1,2         2,NA
11:   4    1       2       1,2         2,NA
12:   4    2       1       1,2         2,NA
13:   4    1       2       1,2         2,NA
14:   4    1       1         1           NA <-- should be (1,2) here, not NA
15:   4    1       1       1,2         2,NA
16:   4    2       1         2           NA
17:   4    2       2         2           NA
18:   4    2       2       1,2         2,NA
19:   8    2      NA     2,4,3      4, 3,NA
20:   8    4       2     2,4,3      4, 3,NA
21:   8    3       4     2,4,3      4, 3,NA
22:   8    3       3     2,4,3      4, 3,NA
23:   8    3       3     2,4,3      4, 3,NA

Any suggestions to lag on lists the same way I lag on other elements?


回答1:


This is documented behavior. Here's part of the example at ?shift:

# on lists
ll = list(1:3, letters[4:1], runif(2))
shift(ll, 1, type="lead")
# [[1]]
# [1]  2  3 NA
# 
# [[2]]
# [1] "c" "b" "a" NA 
# 
# [[3]]
# [1] 0.1190792        NA

To get around this, you can make a unique ID for each value of the list:

dt[, carbList_id := match(carbList, unique(carbList))]

carbList_map = dt[, .(carbList = list(carbList[[1]])), by=carbList_id]

#    carbList_id carbList
# 1:           1        4
# 2:           2      1,2
# 3:           3        1
# 4:           4    2,4,3
# 5:           5        2
# 6:           6      4,8
# 7:           7        6

# or stick with long-form:
carbList_map = dt[, .(carb = carbList[[1]]), by=carbList_id]

#     carbList_id carb
#  1:           1    4
#  2:           2    1
#  3:           2    2
#  4:           3    1
#  5:           4    2
#  6:           4    4
#  7:           4    3
#  8:           5    2
#  9:           6    4
# 10:           6    8
# 11:           7    6

Then, just shift or whatever with the new ID column. When you need the value of the carbList again, you'll have to merge with the new table.

Alternately, if you don't really need to work with the values, but just to browse them, consider making it a string instead, like carbList:=toString(sort(unique(carb))) or with paste0.

Side note: sort before using toString, paste0 or list.




回答2:


User Frank notes that shift doesn't support lists. Here is a solution with for and set that uses data.table to calculate the correct indices for within-group lagging but does all the other work in for. Excepting minor optimizations, is this the best (clean+fast) that I can hope for within data.table?

dt <- data.table(mtcars)[,.(gear, carb, cyl)]
dt[,carbsList:=list(list(unique(carb))), by=.(cyl, gear)]
dt[,':='(rowLag=shift(.I), gearLag=shift(gear)), by=cyl]
dt[,':='(carbsListLag=list())]
cl_j <- which(names(dt) == "carbsListLag")
for (i in 1:nrow(dt)) {
   set(dt, i, cl_j, dt[dt[i,rowLag], list(carbsList)])
}
dt[,.(carb, gear, gearLag, carbsList, carbsListLag, .I, rowLag), by=cyl]
    cyl carb gear gearLag carbsList carbsListLag  I rowLag
 1:   6    4    4      NA         4         NULL  1     NA
 2:   6    4    4       4         4            4  2      1
 3:   6    1    3       4         1            4  4      2
 4:   6    1    3       3         1            1  6      4
...
13:   4    1    4       4       1,2          1,2 20     19
14:   4    1    3       4         1          1,2 21     20
15:   4    1    4       3       1,2            1 26     21
16:   4    2    5       4         2          1,2 27     26
17:   4    2    5       5         2            2 28     27
18:   4    2    4       5       1,2            2 32     28
19:   8    2    3      NA     2,4,3         NULL  5     NA
20:   8    4    3       3     2,4,3        2,4,3  7      5


来源:https://stackoverflow.com/questions/36040542/lagged-lists-in-data-table-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!