Collapse and merge overlapping time intervals

大城市里の小女人 提交于 2021-01-21 09:43:59

问题


I am developing a tidyverse-based data workflow, and came across a situation where I have a data frame with lots of time intervals. Let's call the data frame my_time_intervals, and it can be reproduced like this:

library(tidyverse)
library(lubridate)

my_time_intervals <- tribble(
    ~id, ~group, ~start_time, ~end_time,
    1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
    2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
    3L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
    4L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
    5L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
    6L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
    7L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
    8L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

Here's a tibble view of the same data frame:

> my_time_intervals
# A tibble: 8 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-14 02:32:10
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     3     1 2018-05-07 13:02:04 2018-05-23 08:13:06
4     4     2 2018-02-28 17:43:29 2018-04-20 03:48:40
5     5     2 2018-04-20 01:19:52 2018-08-12 12:56:37
6     6     2 2018-04-18 20:47:22 2018-04-19 16:07:29
7     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
8     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

A few notes about my_time_intervals:

  1. The data is divided into three groups via the group variable.

  2. The id variable is just a unique ID for each row in the data frame.

  3. The start and end of time intervals are stored in start_time and end_time in lubridate form.

  4. Some time intervals overlap, some don't, and they are not always in order. For example, row 1 overlaps with row 3, but neither of them overlaps with row 2.

  5. More than two intervals may overlap with each other, and some intervals fall completely within others. See rows 4 through 6 in group == 2.

What I want is that within each group, collapse any overlapping time intervals into contiguous intervals. In this case, my desired result would look like:

# A tibble: 5 x 4
     id group start_time          end_time           
  <int> <int> <dttm>              <dttm>             
1     1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
2     2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
3     4     2 2018-02-28 17:43:29 2018-08-12 12:56:37
4     7     2 2018-10-02 14:08:03 2018-11-08 00:01:23
5     8     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Notice that time intervals that overlap between different groups are not merged. Also, I don't care about what happens to the id column at this point.

I know that the lubridate package includes interval-related functions, but I can't figure out how to apply them to this use case.

How to I achieve this? Thank you very much.


回答1:


my_time_intervals %>% group_by(group) %>% arrange(start_time) %>% 
                      mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
                      group_by(group, indx) %>%
                      summarise(start_time = min(start_time), end_time = max(end_time)) %>%
                      select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.




回答2:


Another tidyverse method:

library(tidyverse)
library(lubridate)

my_time_intervals %>%
  arrange(group, start_time) %>%
  group_by(group) %>%
  mutate(new_end_time = if_else(end_time >= lead(start_time), lead(end_time), end_time),
         g = new_end_time != end_time | is.na(new_end_time),
         end_time = if_else(end_time != new_end_time & !is.na(new_end_time), new_end_time, end_time)) %>%
  filter(g) %>%
  select(-new_end_time, -g)



回答3:


We could sort by start_time, then nest and use reduce in subtables to merge rows when relevant (using Masoud's data) :

library(tidyverse)
df %>% 
  arrange(start_time) %>% # 
  select(-id) %>%
  nest(start_time, end_time,.key="startend") %>%
  mutate(startend = map(startend,~reduce(
    seq(nrow(.))[-1],
    ~ if(..3[.y,1] <= .x[nrow(.x),2]) 
        if(..3[.y,2] > .x[nrow(.x),2]) `[<-`(.x, nrow(.x), 2, value = ..3[.y,2])
        else .x
      else bind_rows(.x,..3[.y,]),
    .init = .[1,],
    .))) %>%
  arrange(group) %>%
  unnest()

# # A tibble: 7 x 3
# group          start_time            end_time
# <int>              <dttm>              <dttm>
# 1     1 2018-04-12 13:15:03 2018-05-23 10:13:06
# 2     1 2018-07-04 04:53:20 2018-07-14 20:09:01
# 3     1 2018-07-15 03:53:20 2018-07-19 20:09:01
# 4     1 2018-07-20 04:53:20 2018-07-22 20:09:01
# 5     2 2018-02-28 18:43:29 2018-08-12 14:56:37
# 6     2 2018-10-02 16:08:03 2018-11-08 01:01:23
# 7     3 2018-03-11 23:30:51 2018-10-20 23:01:42


来源:https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!