问题
I have some dplyr dataframe and I have some condition. I want to know for each cell what is the index of the first cell that matches the condition in the next x rows.
In my case, I want to have an additional column that holds the index of the first value that was larger than the current value in at least z.
Example: here we are looking for the index of the first value in the next 3 rows that is larger by at least 3 from the current value. In the case of the first row, the value is 0 and the first value in the next 3 cells that is larger by at least 3 is cell number 4 where its value = 3.
In the third row, the value = 2 and in the next 3 rows there is no value that matches the condition so we get a value of NA
value index_of_matched_cell
1 0 4
2 0 4
3 2 NA
4 3 7
5 3 7
6 3 7
7 6 NA
8 6 NA
9 6 NA
Thank you!
回答1:
Here is one way using rollapply
from zoo
:
next_rows <- 3
larger_than <- 3
with(df, zoo::rollapply(seq_along(value), next_rows + 1, function(x)
x[which(value[x] >= (value[x[1]] + larger_than))[1]],
align = 'left', fill = NA))
#[1] 4 4 NA 7 7 7 NA NA NA
In rollapply
we iterate over the index of each row with window size of next_rows + 1
(since we want to consider next 3 rows and rollapply
also considers current row). We compare the current value
with next 3 values and return the first index where it is greater or equal to than larger_than
value and return it's index.
回答2:
Here I suggest you another solution with lapply.
find_match_index <- function(x, larger_than, within){
ii <- seq_along(x)
unlist(lapply(ii,
function(i, v, n, w) {
# here you find all positions that respect your condition
res <- which(v[i]+n<=v)
# here you get only the positions in your range of interest
res <- res[res>i & res <= i+w]
# return only one
res[1]
},
v = x,
n = larger_than,
w = within))
}
df$index_of_matched_cell <- find_match_index(df$value, larger_than = 3, within = 3)
df
回答3:
Manual loop version comparing the original vector and then a 'leading' vector of 3,2,1 ahead and sequentially overwriting the output:
looplook <- function(x, dst, n) {
lead <- function(x,n) c(tail(x,-n), rep(NA,n))
out <- rep(NA, length(x))
for(i in n:1) {
sel <- which(lead(x, i) >= (x + dst))
out[sel] <- sel + i
}
out
}
vec <- c(0L, 0L, 2L, 3L, 3L, 3L, 6L, 6L, 6L)
looplook(vec, dst=3, n=3)
#[1] 4 4 NA 7 7 7 NA NA NA
Seems relatively quick when running some benchmarks on a biggish vector of 900K length:
vec <- c(0L, 0L, 2L, 3L, 3L, 3L, 6L, 6L, 6L)
vec <- rep(vec, 1e5)
system.time(looplook(vec, dst=3, n=3))
# user system elapsed
# 0.031 0.000 0.031
value <- vec
next_rows <- 3
larger_than <- 3
system.time({
zoo::rollapply(seq_along(value), next_rows + 1, function(x)
x[which(value[x] >= (value[x[1]] + larger_than))[1]],
align = 'left', fill = NA)
})
# user system elapsed
# 5.492 0.028 5.519
system.time(find_match_index(vec, larger_than = 3, within = 3))
# C-c C-c
#Timing stopped at: 39.08 0 39.08
来源:https://stackoverflow.com/questions/63297608/r-dplyr-window-function-get-the-first-value-in-the-next-x-window-that-fulfil-so