Identify gaps in time data

不打扰是莪最后的温柔 提交于 2019-12-24 09:47:46

问题


EDIT:

I found a way for the problem below, however, it works on a small dataset but still creates falses output on large datasets. Someone knows why? I can't find the mistake. Here's the code:

df$continuous <-
  unlist(lapply(split(df, df$ID),
                function(x) {
                  sapply(1:nrow(x),
                         function(y) {
                           any(x$start[y] - x$end[-(y:NROW(x$end))] <= 1)
                         })
                }))

ORIGINAL PROBLEM: I'm working on a function to identify a gap in a series of start/end dates. The output should be FALSE if a start date begins later than 1 day after any of the previous end dates.

DATA:

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

This is my attempt to solve this with x = start and y = end:

my_fun <- function(x,y){
  any(x[i] - y[1:NROW(i)-1] <= 1)
}

It works well if I specify i but I don't manage to wrap this into a loop. Ultimately, this function should be applied to groups in a large dataset in a dplyr manner.

This is what it should look like:

  ID      start        end  continuous
1  1 2010-01-01 2010-01-03 FALSE #or TRUE
2  1 2010-01-03 2010-01-22 TRUE
3  1 2010-01-05 2010-01-07 TRUE
4  1 2010-01-09 2010-01-12 TRUE
5  1 2010-02-01 2010-02-10 FALSE
6  1 2010-02-10 2010-02-12 TRUE #according to my function or FALSE compared to start[1] would be even better

I'd very much appreciate some help with this.


回答1:


You can do this using dplyr and lubridate. dplyr has really useful window functions like lag() that are handy for this type of analysis.

library(tidyverse)
library(lubridate)

df %>% 
  mutate(start - lag(end, 1) == 0)

# ID      start        end start - lag(end, 1) == 0
# 1  1 2010-01-01 2010-01-03                       NA
# 2  1 2010-01-03 2010-01-22                     TRUE
# 3  1 2010-01-05 2010-01-07                    FALSE
# 4  1 2010-01-09 2010-01-12                    FALSE
# 5  1 2010-02-01 2010-02-10                    FALSE
# 6  1 2010-02-10 2010-02-12                     TRUE

How do you want to handle the first row of your data? Since there is no previous value, it shows NA. This is generally how you should handle situations like this but I can edit my answer if you'd like it to have a different value.




回答2:


EDIT: I just re-read your question... and you wanted to compare the start with all the end dates. Then try to use the function expand.grid

df <- data.frame('ID' = c('1','1','1','1','1','1'), 'start' = as.Date(c('2010-01-01', '2010-01-03', '2010-01-05', '2010-01-09','2010-02-01', '2010-02-10')),
                 'end' = as.Date(c('2010-01-03', '2010-01-22', '2010-01-07', '2010-01-12', '2010-02-10', '2010-02-12')))

df_expand <- as.data.table(expand.grid(df$start,df$end))

This creates all the combinations between start/end

names(df_expand) = c("start","end")
df_expand[,ID:= 1]
df_expand[,diff_days := difftime(end,start,unit = "days")]
df_expand[,acceptable := any(diff_days<1), by = start]

Here we define what is acceptable..

df = merge(df,df_expand,by = c("start","end","ID"),all.x=TRUE)

I thought.. we might want to keep the original data, so we merge it back and shrink the information that we have.

Wait.. can you provide a proper example of what you expect and what is needed? I just read your other comment and now it confuses me...



来源:https://stackoverflow.com/questions/45676209/identify-gaps-in-time-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!