问题
I would like to find the overlapping dates for each ID and create a new row with the overlapping dates and also combine the characters (char) for the lines. It is possible that my data will have >2 overlaps and need >2 combinations of characters. eg. ERM
Data:
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
Output I would like:
ID date1 date2 char
15 2003-04-05 2003-04-20 E
15 2003-04-20 2003-05-06 ER
15 2003-05-06 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2005-04-15 I
17 2005-04-15 2007-02-22 IC
17 2005-04-15 2007-05-15 C
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2014-05-19 2014-05-19 P
17 2010-06-07 2012-02-14 MV
17 2010-09-22 2011-02-14 VP
17 2012-02-28 2013-03-04 RP
What I have tried: I have tried using subtracting date 2 from the current row from the row below using:
df$diff <- c(NA,df[2:nrow(tdf), "date1"] - df[1:(nrow(df)-1), "date2"])
Then to determine the overlaps between the rows:
df$overlap[which(df$diff<1)] <-1
df$overlap.up <- c(df$overlap[2:(nrow(df))], "NA")
df$overlap.final[which(df$overlap==1 | df$overlap.up==1)] <- 1
I then selected those that had an overlap.final==1 and put them into another dataframe and found the overlaps for each ID.
However, I have realized that this is way too simplistic and flawed, because it only selects overlaps that occur sequentially (using the difference in dates in the first step). What I need to do is to take the series of dates for each ID and loop through each combination to determine if there is an overlap and then, if so, record that start and end date and create a new character “char” signalling what was combined during those two dates. I think I need a loop to do this.
I tried to create a loop to find the overlap intervals between date1 and date 2
df <- df[which(!duplicated(df$ ID)),]
for (i in 1:nrow(df)) {
tmp <- length(which(df $ID[i] & (df$date1[i] >df$date1 & df$date1[i]< df$date2) | (df$date2[i] < df$date2& df$date2[i]> df$date1))) >0
df$int[i]<- tmp
}
However this does not work.
After identifying the overlapping intervals, I need to create new rows for each new start and end date and a new character that represents the overlap.
Another version of the loop I have tried to identify overlaps:
for (i in 1:nrow(df)) {
if (df$ID[i]==IDs$ID){
tmp <- length(df, df$ ID[i]==IDs$ & (df$date1[i]> df$date1 & df$date1 [i]< df$date2 | df$date2[i] < df$date2 & df$date2[i]> df$date1)) >0
df$int[i]<- tmp
}
}
回答1:
First, we create a data.table
of all possible intervals for each ID
.
All possible intervals means that we take all start and end dates of an ID
and combine them in a sorted vector tmp
. The unique values indicate all possible intersections (or breaks) of all given intervals of the ID
on the time axis. For later joining, the breaks are re-arranged in one interval per row with a start
and an end
column:
library(data.table)
options(datatable.print.class = TRUE)
breaks <- DT[, {
tmp <- unique(sort(c(date1, date2)))
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}, by = ID]
breaks
ID start end <int> <IDat> <IDat> 1: 15 2003-04-05 2003-04-20 2: 15 2003-04-20 2003-05-06 3: 15 2003-05-06 2003-06-20 4: 16 2001-01-02 2002-03-04 5: 17 2003-03-05 2005-04-15 6: 17 2005-04-15 2007-02-22 7: 17 2007-02-22 2007-05-15 8: 17 2007-05-15 2008-02-05 9: 17 2008-02-05 2010-06-07 10: 17 2010-06-07 2010-09-22 11: 17 2010-09-22 2011-02-14 12: 17 2011-02-14 2012-02-14 13: 17 2012-02-14 2012-02-28 14: 17 2012-02-28 2013-03-04 15: 17 2013-03-04 2014-05-19
Then, a non-equi join is performed whereby the values are aggregated simultaneously on the join conditions (by = .EACHI
is called grouping by each i, see this answer for a more detailed explanation):
DT[breaks, on = .(ID, date1 <= start, date2 >= end), paste(char, collapse = ""),
by = .EACHI, allow.cartesian = TRUE]
ID date1 date2 V1 <int> <IDat> <IDat> <char> 1: 15 2003-04-05 2003-04-20 E 2: 15 2003-04-20 2003-05-06 ER 3: 15 2003-05-06 2003-06-20 R 4: 16 2001-01-02 2002-03-04 M 5: 17 2003-03-05 2005-04-15 I 6: 17 2005-04-15 2007-02-22 IC 7: 17 2007-02-22 2007-05-15 C 8: 17 2007-05-15 2008-02-05 CI 9: 17 2008-02-05 2010-06-07 CM 10: 17 2010-06-07 2010-09-22 CMV 11: 17 2010-09-22 2011-02-14 CMVP 12: 17 2011-02-14 2012-02-14 CMP 13: 17 2012-02-14 2012-02-28 CP 14: 17 2012-02-28 2013-03-04 CPR 15: 17 2013-03-04 2014-05-19 CP
The result differs from the expected result posted by the OP but plotting the data convinces that the above result shows all possible overlaps:
library(ggplot2)
ggplot(DT) + aes(y = char, yend = char, x = date1, xend = date2) +
geom_segment() + facet_wrap("ID", ncol = 1L)
Data
library(data.table)
DT <- fread(
"ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R"
)
cols <- c("date1", "date2")
DT[, (cols) := lapply(.SD, as.IDate), .SDcols = cols]
回答2:
Introduction
The for
-loop you added to your question and the included comparison were a good start. The should be some additional brackets (
and )
in the date comparison. This for
-loop-approach automatically considers new rows in the data frame. Therefore, you can get three-, four- and more-character strings in the char
column.
Create input data
df = as.data.frame(list('ID'=c(15, 15, 16, 17, 17, 17, 17, 17, 17, 17),
'date1'=as.Date(c('2003-04-05', '2003-04-20', '2001-01-02', '2003-03-05', '2005-04-15', '2007-05-15', '2008-02-05', '2010-06-07', '2010-09-22', '2012-02-28')),
'date2'=as.Date(c('2003-05-06', '2003-06-20', '2002-03-04', '2007-02-22', '2014-05-19', '2008-02-05', '2012-02-14', '2011-02-14', '2014-05-19', '2013-03-04')),
'char'=c('E', 'R', 'M', 'I', 'C', 'I', 'M', 'V', 'P', 'R')),
stringsAsFactors=FALSE)
Solution
Iterate all rows (that were existing in the original data.frame) and compare them to all current lines.
nrow_init = nrow(df)
for (i in 1:(nrow(df)-1)) {
print(i)
## get rows of df that have overlapping dates
## (1:nrow(df))>i :: consider only rows below the current row to avoid double processing of two row-pairs
## (!grepl(df$char[i],df$char)) :: prevent double letters
## Because we call nrow(df) each time (and not save it as a variable once in the beginning), we consider also new rows here. Therefore, we do not need the specific procedure for comparing 3 or more rows.
loc = ((1:nrow(df))>i) & (!grepl(df$char[i],df$char)) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date1>df$date1[i]) & (df$date1<df$date2[i])) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)) | ((df$date2<df$date2[i]) & (df$date2>df$date1[i])))
## Uncomment this line, if you want to compare only two rows each and not more
# loc = ((1:nrow(df))<=nrow_init) & ((1:nrow(df))>i) & (df$ID[i]==df$ID) & (((df$date1[i]>df$date1) & (df$date1[i]<df$date2)) | ((df$date2[i]<df$date2) & (df$date2[i]>df$date1)))
## proceed only of at least one duplicate row was found
if (sum(loc) > 0) {
# build new rows
# pmax and pmin do element-wise min and max calculation; df$date1[i] and df$date2[i] are automatically extended to the length of df$date1[loc] and df$date2[loc], respectively
df_append = as.data.frame(list('ID'=df$ID[loc],
'date1'=pmax(df$date1[i],df$date1[loc]),
'date2'=pmin(df$date2[i],df$date2[loc]),
'char'=paste0(df$char[i],df$char[loc])))
## append new rows
df = rbind(df, df_append)
}
}
## create a new column and sort the characters in it
## idea for sort: https://stackoverflow.com/a/5904854/4612235
df$sort_char = df$char
for (i in 1:nrow(df)) df$sort_char[i] = paste(sort(unlist(strsplit(df$sort_char[i], ""))), collapse = "")
## remove duplicates
df = df[!duplicated(df[c('ID', 'date1', 'date2', 'sort_char')]),]
## remove additional column
df$sort_char = NULL
Out put
ID date1 date2 char
15 2003-04-05 2003-05-06 E
15 2003-04-20 2003-06-20 R
16 2001-01-02 2002-03-04 M
17 2003-03-05 2007-02-22 I
17 2005-04-15 2014-05-19 C
17 2007-05-15 2008-02-05 I
17 2008-02-05 2012-02-14 M
17 2010-06-07 2011-02-14 V
17 2010-09-22 2014-05-19 P
17 2012-02-28 2013-03-04 R
15 2003-04-20 2003-05-06 ER
17 2005-04-15 2007-02-22 IC
17 2007-05-15 2008-02-05 CI
17 2008-02-05 2012-02-14 CM
17 2010-06-07 2011-02-14 CV
17 2010-09-22 2014-05-19 CP
17 2012-02-28 2013-03-04 CR
17 2010-06-07 2011-02-14 MV
17 2010-09-22 2012-02-14 MP
17 2010-06-07 2011-02-14 MCV
17 2010-09-22 2012-02-14 MCP
17 2010-09-22 2011-02-14 VP
17 2010-09-22 2011-02-14 VCP
17 2010-09-22 2011-02-14 VMP
17 2010-09-22 2011-02-14 VMCP
17 2012-02-28 2013-03-04 PR
17 2012-02-28 2013-03-04 PCR
来源:https://stackoverflow.com/questions/46151452/find-overlapping-dates-for-each-id-and-create-a-new-row-for-the-overlap