R dplyr window function, get the first value in the next x window that fulfil some condition

问题

I have some dplyr dataframe and I have some condition. I want to know for each cell what is the index of the first cell that matches the condition in the next x rows.

In my case, I want to have an additional column that holds the index of the first value that was larger than the current value in at least z.

Example: here we are looking for the index of the first value in the next 3 rows that is larger by at least 3 from the current value. In the case of the first row, the value is 0 and the first value in the next 3 cells that is larger by at least 3 is cell number 4 where its value = 3.

In the third row, the value = 2 and in the next 3 rows there is no value that matches the condition so we get a value of NA

  value index_of_matched_cell
1     0                       4
2     0                       4
3     2                      NA
4     3                       7
5     3                       7
6     3                       7
7     6                      NA
8     6                      NA
9     6                      NA

Thank you!

回答1:

Here is one way using rollapply from zoo :

next_rows <- 3
larger_than <- 3

with(df, zoo::rollapply(seq_along(value), next_rows + 1, function(x) 
               x[which(value[x] >= (value[x[1]] + larger_than))[1]],
               align = 'left', fill = NA))

#[1]  4  4 NA  7  7  7 NA NA NA

In rollapply we iterate over the index of each row with window size of next_rows + 1 (since we want to consider next 3 rows and rollapply also considers current row). We compare the current value with next 3 values and return the first index where it is greater or equal to than larger_than value and return it's index.

回答2:

Here I suggest you another solution with lapply.

find_match_index <- function(x, larger_than, within){

    ii <- seq_along(x)  

    unlist(lapply(ii, 
                  function(i, v, n, w) {

                    # here you find all positions that respect your condition
                    res <- which(v[i]+n<=v)  

                    # here you get only the positions in your range of interest
                    res <- res[res>i & res <= i+w]

                    # return only one
                    res[1]
                                    
                 }, 
                 v = x,
                 n = larger_than,
                 w = within))
}

df$index_of_matched_cell <- find_match_index(df$value, larger_than = 3, within = 3)

df

回答3:

Manual loop version comparing the original vector and then a 'leading' vector of 3,2,1 ahead and sequentially overwriting the output:

looplook <- function(x, dst, n) {
    lead <- function(x,n) c(tail(x,-n), rep(NA,n))
    out <- rep(NA, length(x))
    for(i in n:1) {
        sel <- which(lead(x, i) >= (x + dst))
        out[sel] <- sel + i
    }
    out
}

vec <- c(0L, 0L, 2L, 3L, 3L, 3L, 6L, 6L, 6L)

looplook(vec, dst=3, n=3)
#[1]  4  4 NA  7  7  7 NA NA NA

Seems relatively quick when running some benchmarks on a biggish vector of 900K length:

vec <- c(0L, 0L, 2L, 3L, 3L, 3L, 6L, 6L, 6L)
vec <- rep(vec, 1e5)

system.time(looplook(vec, dst=3, n=3))
#   user  system elapsed 
#  0.031   0.000   0.031 

value <- vec

next_rows <- 3
larger_than <- 3
system.time({
zoo::rollapply(seq_along(value), next_rows + 1, function(x) 
               x[which(value[x] >= (value[x[1]] + larger_than))[1]],
               align = 'left', fill = NA)
})
#   user  system elapsed 
#  5.492   0.028   5.519 

system.time(find_match_index(vec, larger_than = 3, within = 3))
#  C-c C-c
#Timing stopped at: 39.08 0 39.08

来源：https://stackoverflow.com/questions/63297608/r-dplyr-window-function-get-the-first-value-in-the-next-x-window-that-fulfil-so

标签

dplyr

tidyverse