Find which interval row in a data frame that each element of a vector belongs in

后端 未结 7 681
一整个雨季
一整个雨季 2020-11-29 06:29

I have a vector of numeric elements, and a dataframe with two columns that define the start and end points of intervals. Each row in the dataframe is one interval. I want to

相关标签:
7条回答
  • 2020-11-29 06:40

    Inspired by @thelatemail's cut solution, here is one using findInterval which still requires a lot of typing:

    out <- findInterval(elements, t(intervals[c("start","end")]), left.open = TRUE)
    out[!(out %% 2)] <- NA
    intervals$phase[out %/% 2L + 1L]
    #[1] "a" "a" "a" NA  "b" "b" "c"
    

    Caveat cut and findInterval have left-open intervals. Therefore, solutions using cut and findInterval are not equivalent to Ben's using intrval, David's non-equi join using data.table, and my other solution using foverlaps.

    0 讨论(0)
  • 2020-11-29 06:42

    For completion sake, here is another way, using the intervals package:

    library(tidyverse)
    elements <- c(0.1, 0.2, 0.5, 0.9, 1.1, 1.9, 2.1)
    
    intervalsDF <- 
      frame_data(  ~phase, ~start, ~end,
                   "a",     0,      0.5,
                   "b",     1,      1.9,
                   "c",     2,      2.5
      )
    
    library(intervals)
    library(rlist)
    
    interval_overlap(
      Intervals(intervalsDF %>% select(-phase) %>% as.matrix, closed = c(TRUE, TRUE)),
      Intervals(data_frame(start = elements, end = elements), closed = c(TRUE, TRUE))
    ) %>% 
      list.map(data_frame(interval_index = .i, element_index = .)) %>% 
      do.call(what = bind_rows)
    
    # A tibble: 6 × 2
    #  interval_index element_index
    #           <int>         <int>
    #1              1             1
    #2              1             2
    #3              1             3
    #4              2             5
    #5              2             6
    #6              3             7
    
    0 讨论(0)
  • 2020-11-29 06:44

    Here's a possible solution using the new "non-equi" joins in data.table (v>=1.9.8). While I doubt you'll like the syntax, it should be very efficient soluion.

    Also, regarding findInterval, this function assumes continuity in your intervals, while this isn't the case here, so I doubt there is a straightforward solution using it.

    library(data.table) #v1.10.0
    setDT(intervals)[data.table(elements), on = .(start <= elements, end >= elements)]
    #    phase start end
    # 1:     a   0.1 0.1
    # 2:     a   0.2 0.2
    # 3:     a   0.5 0.5
    # 4:    NA   0.9 0.9
    # 5:     b   1.1 1.1
    # 6:     b   1.9 1.9
    # 7:     c   2.1 2.1
    

    Regarding the above code, I find it pretty self-explanatory: Join intervals and elements by the condition specified in the on operator. That's pretty much it.

    There is a certain caveat here though, start, end and elements should be all of the same type, so if one of them is integer, it should be converted to numeric first.

    0 讨论(0)
  • 2020-11-29 06:51

    David Arenburg's mention of non-equi joins was very helpful for understanding what general kind of problem this is (thanks!). I can see now that it's not implemented for dplyr. Thanks to this answer, I see that there is a fuzzyjoin package that can do it in the same idiom. But it's barely any simpler than my map solution above (though more readable, in my view), and doesn't hold a candle to thelatemail's cut answer for brevity.

    For my example above, the fuzzyjoin solution would be

    library(fuzzyjoin)
    library(tidyverse)
    
    fuzzy_left_join(data.frame(elements), intervals, 
                    by = c("elements" = "start", "elements" = "end"), 
                    match_fun = list(`>=`, `<=`)) %>% 
      distinct()
    

    Which gives:

        elements phase start end
    1      0.1     a     0   0.5
    2      0.2     a     0   0.5
    3      0.5     a     0   0.5
    4      0.9  <NA>    NA    NA
    5      1.1     b     1   1.9
    6      1.9     b     1   1.9
    7      2.1     c     2   2.5
    
    0 讨论(0)
  • 2020-11-29 06:51

    Here is kind of a "one-liner" which (mis-)uses foverlaps from the data.table package but David's non-equi join is still more concise:

    library(data.table) #v1.10.0
    foverlaps(data.table(start = elements, end = elements), 
              setDT(intervals, key = c("start", "end")))
    #   phase start end i.start i.end
    #1:     a     0 0.5     0.1   0.1
    #2:     a     0 0.5     0.2   0.2
    #3:     a     0 0.5     0.5   0.5
    #4:    NA    NA  NA     0.9   0.9
    #5:     b     1 1.9     1.1   1.1
    #6:     b     1 1.9     1.9   1.9
    #7:     c     2 2.5     2.1   2.1
    
    0 讨论(0)
  • 2020-11-29 06:55

    Just lapply works:

    l <- lapply(elements, function(x){
        intervals$phase[x >= intervals$start & x <= intervals$end]
    })
    
    str(l)
    ## List of 7
    ##  $ : chr "a"
    ##  $ : chr "a"
    ##  $ : chr "a"
    ##  $ : chr(0) 
    ##  $ : chr "b"
    ##  $ : chr "b"
    ##  $ : chr "c"
    

    or in purrr, if you purrrfurrr,

    elements %>% 
        map(~intervals$phase[.x >= intervals$start & .x <= intervals$end]) %>% 
        # Clean up a bit. Shorter, but less readable: map_chr(~.x[1] %||% NA)
        map_chr(~ifelse(length(.x) == 0, NA, .x))
    ## [1] "a" "a" "a" NA  "b" "b" "c"
    
    0 讨论(0)
提交回复
热议问题