Find matching intervals in data frame by range of two column values

前端 未结 1 1439
南旧
南旧 2021-01-11 19:16

I have a data frame of time related events.

Here is an example:

Name     Event Order     Sequence     start_event     end_event     duration     Gr         


        
1条回答
  •  北荒
    北荒 (楼主)
    2021-01-11 19:30

    As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:

    overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
    do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
      jpos <- which(x$Name == "JOHN")
      njpos <- which(x$Name != "JOHN")
      over <- outer(jpos, njpos, function(a, b) {
        overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
      })
      x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
    }))
    #      Name EventOrder Sequence start_event end_event duration Group
    # A.2  JOHN          2        A          60       112       52   ID1
    # A.3  JOHN          3        A         392       429       37   ID1
    # A.7  ADAM          1        A          19        75       56   ID2
    # A.8  ADAM          2        A         384       407       23   ID2
    # C.5  JOHN          5        C         147       226       79   ID1
    # C.6  JOHN          6        C         566       611       45   ID1
    # C.11 ADAM          5        C         140       205       65   ID2
    # C.12 ADAM          6        C         522       599       77   ID2
    

    Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].

    This could be pretty easily mapped into dplyr language:

    library(dplyr)
    overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
    sliceRows <- function(name, start, end) {
      jpos <- which(name == "JOHN")
      njpos <- which(name != "JOHN")
      over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
      c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
    }
    dat %>%
      group_by(Sequence) %>%
      slice(sliceRows(Name, start_event, end_event))
    # Source: local data frame [8 x 7]
    # Groups: Sequence [3]
    # 
    #     Name EventOrder Sequence start_event end_event duration  Group
    #   (fctr)      (int)   (fctr)       (int)     (int)    (int) (fctr)
    # 1   JOHN          2        A          60       112       52    ID1
    # 2   JOHN          3        A         392       429       37    ID1
    # 3   ADAM          1        A          19        75       56    ID2
    # 4   ADAM          2        A         384       407       23    ID2
    # 5   JOHN          5        C         147       226       79    ID1
    # 6   JOHN          6        C         566       611       45    ID1
    # 7   ADAM          5        C         140       205       65    ID2
    # 8   ADAM          6        C         522       599       77    ID2
    

    If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:

    overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
    pair.overlap <- function(dat, user1, user2) {
      dat <- dat[dat$Name %in% c(user1, user2),]
      do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
        jpos <- which(x$Name == user1)
        njpos <- which(x$Name == user2)
        over <- outer(jpos, njpos, function(a, b) {
          overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
        })
        x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
      }))
    }
    

    You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:

    apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))
    

    0 讨论(0)
提交回复
热议问题