问题
shift
in R
's data.table
is great for time series and time window stuff. But columns of lists don't lag the same way that columns of other elements do. In the code below, gearLag
lead/lags gear
correctly, but gearsListLag
isn't lagging gearsList
, instead, shift
is operating within gearsList
to lag the element on itself in the same row.
dt <- data.table(mtcars)[,.(gear, carb, cyl)]
### Make col of lists
dt[,carbList:=list(list(unique(carb))), by=.(cyl, gear)]
### Now I want to lag/lead col of lists
dt[,.(carb, carbLag=shift(carb)
, carbList, carbListLag=shift(carbList, type="lead")), by=cyl]
cyl carb carbLag carbList carbListLag
1: 6 4 NA 4 NA
2: 6 4 4 4 NA
3: 6 1 4 1 NA <-- should be 4 here, not NA
4: 6 1 1 1 NA
5: 6 4 1 4 NA
6: 6 4 4 4 NA
7: 6 6 4 6 NA
8: 4 1 NA 1,2 2,NA
9: 4 2 1 1,2 2,NA
10: 4 2 2 1,2 2,NA
11: 4 1 2 1,2 2,NA
12: 4 2 1 1,2 2,NA
13: 4 1 2 1,2 2,NA
14: 4 1 1 1 NA <-- should be (1,2) here, not NA
15: 4 1 1 1,2 2,NA
16: 4 2 1 2 NA
17: 4 2 2 2 NA
18: 4 2 2 1,2 2,NA
19: 8 2 NA 2,4,3 4, 3,NA
20: 8 4 2 2,4,3 4, 3,NA
21: 8 3 4 2,4,3 4, 3,NA
22: 8 3 3 2,4,3 4, 3,NA
23: 8 3 3 2,4,3 4, 3,NA
Any suggestions to lag on lists the same way I lag on other elements?
回答1:
This is documented behavior. Here's part of the example at ?shift
:
# on lists ll = list(1:3, letters[4:1], runif(2)) shift(ll, 1, type="lead")
# [[1]]
# [1] 2 3 NA
#
# [[2]]
# [1] "c" "b" "a" NA
#
# [[3]]
# [1] 0.1190792 NA
To get around this, you can make a unique ID for each value of the list:
dt[, carbList_id := match(carbList, unique(carbList))]
carbList_map = dt[, .(carbList = list(carbList[[1]])), by=carbList_id]
# carbList_id carbList
# 1: 1 4
# 2: 2 1,2
# 3: 3 1
# 4: 4 2,4,3
# 5: 5 2
# 6: 6 4,8
# 7: 7 6
# or stick with long-form:
carbList_map = dt[, .(carb = carbList[[1]]), by=carbList_id]
# carbList_id carb
# 1: 1 4
# 2: 2 1
# 3: 2 2
# 4: 3 1
# 5: 4 2
# 6: 4 4
# 7: 4 3
# 8: 5 2
# 9: 6 4
# 10: 6 8
# 11: 7 6
Then, just shift
or whatever with the new ID column. When you need the value of the carbList
again, you'll have to merge with the new table.
Alternately, if you don't really need to work with the values, but just to browse them, consider making it a string instead, like carbList:=toString(sort(unique(carb)))
or with paste0
.
Side note: sort before using toString
, paste0
or list
.
回答2:
User Frank notes that shift
doesn't support lists. Here is a solution with for
and set
that uses data.table
to calculate the correct indices for within-group lagging but does all the other work in for
. Excepting minor optimizations, is this the best (clean+fast) that I can hope for within data.table
?
dt <- data.table(mtcars)[,.(gear, carb, cyl)]
dt[,carbsList:=list(list(unique(carb))), by=.(cyl, gear)]
dt[,':='(rowLag=shift(.I), gearLag=shift(gear)), by=cyl]
dt[,':='(carbsListLag=list())]
cl_j <- which(names(dt) == "carbsListLag")
for (i in 1:nrow(dt)) {
set(dt, i, cl_j, dt[dt[i,rowLag], list(carbsList)])
}
dt[,.(carb, gear, gearLag, carbsList, carbsListLag, .I, rowLag), by=cyl]
cyl carb gear gearLag carbsList carbsListLag I rowLag
1: 6 4 4 NA 4 NULL 1 NA
2: 6 4 4 4 4 4 2 1
3: 6 1 3 4 1 4 4 2
4: 6 1 3 3 1 1 6 4
...
13: 4 1 4 4 1,2 1,2 20 19
14: 4 1 3 4 1 1,2 21 20
15: 4 1 4 3 1,2 1 26 21
16: 4 2 5 4 2 1,2 27 26
17: 4 2 5 5 2 2 28 27
18: 4 2 4 5 1,2 2 32 28
19: 8 2 3 NA 2,4,3 NULL 5 NA
20: 8 4 3 3 2,4,3 2,4,3 7 5
来源:https://stackoverflow.com/questions/36040542/lagged-lists-in-data-table-r