How do I remove rows based on a range of dates given by values in 2 columns?

问题

I have a data set that includes a range of dates and need to fill in the missing dates in new rows. df1 is an example of the data I am working with and df2 is an example of what I've managed to achieve (where I'm stuck). df3 is where I would like to end up!

df1
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021  
2      1/22/2021  1/22/2021     1/23/2021

This is where I am with this. The NAs aren't an issue because I intend to drop the DateStart and DateEnd columns after doing what I need to do. The issue here is that I don't want to include the dates that fall within the previous DateStart and DateEnd range. To get here I grouped by ID and filled in the missing dates between the dates in df1:

df2
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/12/2021  NA            NA
1      2/13/2021  NA            NA
1      2/14/2021  NA            NA
1      2/15/2021  NA            NA
1      2/16/2021  NA            NA
1      2/17/2021  NA            NA
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/16/2021  NA            NA
2      1/17/2021  NA            NA
2      1/18/2021  NA            NA
2      1/19/2021  NA            NA
2      1/20/2021  NA            NA
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021

This is actually what I'd like to end up with:

df3
ID     Date       DateStart     DateEnd
1      2/11/2021  2/11/2021     2/17/2021
1      2/18/2021  NA            NA
1      2/19/2021  2/19/2021     2/21/2021
2      1/15/2021  1/15/2021     1/20/2021
2      1/21/2021  NA            NA
2      1/22/2021  NA            NA    
2      1/23/2021  1/23/2021     1/24/2021

In df3 the missing dates are filled in but not the dates within the DateStart-DateEnd range.

Any thoughts on how to achieve this? Note: I have a dataset with a large number of observations.

回答1:

Convert date columns to date class.
For each ID use complete to create sequence of dates from minimum of DateStart to maximum of DateEnd.
fill the NA values with previous non-NA except where Date > DateEnd.
For every group of ID, DateStart and DateEnd keep the rows with NA values or row number 1 in each group.

library(dplyr)
library(tidyr)

df %>%
  mutate(across(-ID, lubridate::mdy)) %>%
  group_by(ID) %>%
  complete(Date = seq(min(DateStart), max(DateEnd), by = '1 day')) %>%
  fill(DateStart, DateEnd) %>%
  ungroup %>%
  mutate(across(c(DateStart, DateEnd), ~replace(., Date > DateEnd, NA))) %>%
  group_by(ID, DateStart, DateEnd) %>%
  filter(is.na(DateStart) | row_number() == 1)

#     ID Date       DateStart  DateEnd   
#  <int> <date>     <date>     <date>    
#1     1 2021-02-11 2021-02-11 2021-02-17
#2     1 2021-02-18 NA         NA        
#3     1 2021-02-19 2021-02-19 2021-02-21
#4     2 2021-01-15 2021-01-15 2021-01-20
#5     2 2021-01-21 NA         NA        
#6     2 2021-01-22 NA         NA        
#7     2 2021-01-23 2021-01-23 2021-01-24

data

df <- structure(list(ID = c(1L, 1L, 2L, 2L), Date = c("2/11/2021", 
"2/19/2021", "1/15/2021", "1/23/2021"), DateStart = c("2/11/2021", 
"2/19/2021", "1/15/2021", "1/23/2021"), DateEnd = c("2/17/2021", 
"2/21/2021", "1/20/2021", "1/24/2021")), 
class = "data.frame", row.names = c(NA, -4L))

来源：https://stackoverflow.com/questions/66152303/how-do-i-remove-rows-based-on-a-range-of-dates-given-by-values-in-2-columns

标签

date

range

tidyverse

fill