问题
I have a panel dataset: panel and a dataset with a list of events: Events. For the panel dataset, an equal panelID shows that two observations belong together.
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "GRC", "GBR")
n <- 2
library(data.table)
set.seed(123)
Panel <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
Panel[, uniqueID := .I] # Creates a unique ID
Panel[Panel == 0] <- NA
Events <- fread(
"Event_Type country year
A NLD 2005
A NLD 2004
A GBR 2006
A GBR 2003
A GRC 2002
A GRC 2007",
header = TRUE)
I want to know how often Events
happen between the panel obervations, split out per year. As an example, for the panel observations with panelID == 2
, in the country NLD there are two events, in or in between the years
of that panel observation, namely in 2004 and 2005. Hence:
DESIRED OUTPUT:
panleID country year 2002 2003 2004 2005 2006 2007
2 NLD 2004 NA NA 1 1 NA NA
2 NLD 2007 NA NA 1 1 NA NA
Based on the solution by Cole I tried to do the following:
# cast Event
Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- sort(unique(Events[['year']]))
# non-equi update join
Panel[Events_cast,
on = .(country,
start <= year,
end >= year),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
But at the # non-equi update join
I get the error: Error in [.data.table (Panel, Events, on = .(country, : LHS of := appears to be column positions but are outside [1,ncol] range. New columns can only be added by name.
回答1:
data.table is trying to use the years to figure out which column you are selecting. The error is telling you 2006
and other years are not valid column numbers. The fix is easy:
cols <- as.character(sort(unique(Events[['year']])))
And here's everything together with a few other changes including:
- Using
data.table::dcast
instead ofreshape2::dcast
- Adding
start
andend
to theEvents
data.table and casting with those columns.
# cast Event
# Events_cast <- reshape2::dcast(Events, country + year ~ year, length, value.var="year")
Events[, `:=`(start = min(year), end = max(year)), by = country]
Events_cast <- dcast(Events, country + start + end~ year, length)
# update by reference for join later
Panel[, `:=`(start = min(year), end = max(year)), by = panelID]
# dcast sorts the rhs alphabetically
cols <- as.character(sort(unique(Events[['year']])))
# non-equi update join
# Panel[Events_cast,
# on = .(country,
# start <= year,
# end >= year),
# (cols) := mget(cols)]
Panel[Events_cast,
on = .(country,
start <= start,
end >= end),
(cols) := mget(cols)]
#clean up data frame
setnafill(Panel, fill = 0L, cols = cols)
Panel[, `:=`(start = NULL, end = NULL)]
Panel
来源:https://stackoverflow.com/questions/59047386/dcast-merge-based-on-a-column-with-a-value-within-a-certain-range