I have two vector e
and g
. I want to know for each element in e
the percentage of elements in g
that are smaller. One way to implement this in R is:
set.seed(21)
e <- rnorm(1e4)
g <- rnorm(1e4)
mf <- function(p,v) {100*length(which(v<=p))/length(v)}
mf.out <- sapply(X=e, FUN=mf, v=g)
With large e
or g
, this takes a lot of time to run. How can I change or adapt this code to make this run faster?
Note: The mf
function above is based on code from the mess
function in the dismo package.
The reason this is so slow is because you're calling your function length(e)
times. It doesn't make a large difference for small vectors, but the overhead from R function calls really starts to add up with larger vectors.
Normally, you would need to move this to compiled code, but luckily you can use findInterval
:
set.seed(21)
e <- rnorm(1e4)
g <- rnorm(1e4)
O <- findInterval(e,sort(g))/length(g)
# Now for some timings:
f <- function(p,v) mean(v<=p)
system.time(o <- sapply(e, f, g))
# user system elapsed
# 0.95 0.03 0.98
system.time(O <- findInterval(e,sort(g))/length(g))
# user system elapsed
# 0 0 0
identical(o,O) # may be FALSE
all.equal(o,O) # should be TRUE
# How fast is this on large vectors?
set.seed(21)
e <- rnorm(1e7)
g <- rnorm(1e7)
system.time(O <- findInterval(e,sort(g))/length(g))
# user system elapsed
# 22.08 0.08 22.31
来源:https://stackoverflow.com/questions/12982152/speeding-up-function-that-uses-which-within-a-sapply-call-in-r