问题
I have a data set that includes a range of dates and need to fill in the missing dates in new rows. df1
is an example of the data I am working with and df2
is an example of what I've managed to achieve (where I'm stuck). df3
is where I would like to end up!
df1
ID Date DateStart DateEnd
1 2/11/2021 2/11/2021 2/17/2021
1 2/19/2021 2/19/2021 2/21/2021
2 1/15/2021 1/15/2021 1/20/2021
2 1/22/2021 1/22/2021 1/23/2021
This is where I am with this. The NAs aren't an issue because I intend to drop the DateStart and DateEnd columns after doing what I need to do. The issue here is that I don't want to include the dates that fall within the previous DateStart and DateEnd range.
To get here I grouped by ID and filled in the missing dates between the dates in df1
:
df2
ID Date DateStart DateEnd
1 2/11/2021 2/11/2021 2/17/2021
1 2/12/2021 NA NA
1 2/13/2021 NA NA
1 2/14/2021 NA NA
1 2/15/2021 NA NA
1 2/16/2021 NA NA
1 2/17/2021 NA NA
1 2/18/2021 NA NA
1 2/19/2021 2/19/2021 2/21/2021
2 1/15/2021 1/15/2021 1/20/2021
2 1/16/2021 NA NA
2 1/17/2021 NA NA
2 1/18/2021 NA NA
2 1/19/2021 NA NA
2 1/20/2021 NA NA
2 1/21/2021 NA NA
2 1/22/2021 NA NA
2 1/23/2021 1/23/2021 1/24/2021
This is actually what I'd like to end up with:
df3
ID Date DateStart DateEnd
1 2/11/2021 2/11/2021 2/17/2021
1 2/18/2021 NA NA
1 2/19/2021 2/19/2021 2/21/2021
2 1/15/2021 1/15/2021 1/20/2021
2 1/21/2021 NA NA
2 1/22/2021 NA NA
2 1/23/2021 1/23/2021 1/24/2021
In df3
the missing dates are filled in but not the dates within the DateStart-DateEnd range.
Any thoughts on how to achieve this? Note: I have a dataset with a large number of observations.
回答1:
Convert date columns to date class.
For each
ID
usecomplete
to create sequence of dates from minimum ofDateStart
to maximum ofDateEnd
.fill
theNA
values with previous non-NA except whereDate > DateEnd
.For every group of
ID
,DateStart
andDateEnd
keep the rows withNA
values or row number 1 in each group.
library(dplyr)
library(tidyr)
df %>%
mutate(across(-ID, lubridate::mdy)) %>%
group_by(ID) %>%
complete(Date = seq(min(DateStart), max(DateEnd), by = '1 day')) %>%
fill(DateStart, DateEnd) %>%
ungroup %>%
mutate(across(c(DateStart, DateEnd), ~replace(., Date > DateEnd, NA))) %>%
group_by(ID, DateStart, DateEnd) %>%
filter(is.na(DateStart) | row_number() == 1)
# ID Date DateStart DateEnd
# <int> <date> <date> <date>
#1 1 2021-02-11 2021-02-11 2021-02-17
#2 1 2021-02-18 NA NA
#3 1 2021-02-19 2021-02-19 2021-02-21
#4 2 2021-01-15 2021-01-15 2021-01-20
#5 2 2021-01-21 NA NA
#6 2 2021-01-22 NA NA
#7 2 2021-01-23 2021-01-23 2021-01-24
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L), Date = c("2/11/2021",
"2/19/2021", "1/15/2021", "1/23/2021"), DateStart = c("2/11/2021",
"2/19/2021", "1/15/2021", "1/23/2021"), DateEnd = c("2/17/2021",
"2/21/2021", "1/20/2021", "1/24/2021")),
class = "data.frame", row.names = c(NA, -4L))
来源:https://stackoverflow.com/questions/66152303/how-do-i-remove-rows-based-on-a-range-of-dates-given-by-values-in-2-columns