I have two large datasets, df1 and df2. The first dataset, df1, contains the columns \'ID\' and \'actual.data\'.
df1 <- data.frame(ID=c(1,1,1,2,3,4,4), a
You may use foverlaps
from data.table
. Convert both the 'data.frame's to 'data.table' with 'start/end' columns. Set the key column as the column names of each dataset. Use foverlaps
to get the numeric index which can be converted to binary 'match' based on the NA values in it.
library(data.table)#v1.9.5+
dt1 <- data.table(ID=df1$ID, start=df1$actual.date, end=df1$actual.date)
setkeyv(dt1, colnames(dt1))
dt2 <- as.data.table(df2)
setnames(dt2, 2:3, c('start', 'end'))
setkeyv(dt2, colnames(dt2))
indx <- foverlaps(dt1, dt2, type='within', which=TRUE, mult='first')
dt1[, match:= +(!is.na(indx))][,end:=NULL]
setnames(dt1, 1:2, colnames(df1))
dt1
# ID actual.date match
#1: 1 1997-10-01 0
#2: 1 1998-02-01 1
#3: 1 2002-05-01 1
#4: 2 1999-07-01 0
#5: 3 2005-09-01 1
#6: 4 2003-02-03 1
#7: 4 2006-05-01 0
Here is a solution with dplyr
library(dplyr)
dat <- inner_join(df1, df2, by = "ID")
dat %>% rowwise() %>%
mutate(match = ifelse(between(actual.date, before.date, after.date), 1, 0)) %>%
select(-c(before.date, after.date)) %>%
arrange(actual.date, desc(match)) %>%
distinct(actual.date)
the output is slightly different because it order the actual.date
, maybe this is a problem, I'll delete my solution if the case.
Source: local data frame [7 x 3]
ID actual.date match
1 1 1997-10-01 0
2 1 1998-02-01 1
3 2 1999-07-01 0
4 1 2002-05-01 1
5 4 2003-02-03 1
6 3 2005-09-01 1
7 4 2006-05-01 0
Just another hopefully correct answer using the fuzzyjoin
package.
library(data.table)
library(fuzzyjoin)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
fuzzy_left_join(dt1
, dt2,
by = c("ID" = "ID", "actual.date" = "before.date", "actual.date" = "after.date"),
match_fun = list(`==`, `>`, `<`))[,.(ID = ID.x
,actual.date
, match = ifelse(is.na(ID.y),0,1))]