I have a fairly big vector (>500,000 in length). It contains a bunch of NA
interspersed with 1
and it is always guaranteed that it begins with 1<
The function fun1 can be speeded up considerably by using the compiler package. Using the code provided by Joshua and extending it with the compiler package:
library(zoo) # for na.locf
library(rbenchmark)
library(compiler)
v1 <- c(1,NA,NA,NA,1,NA,NA,NA,NA,NA,1,NA,NA,1,NA,1,NA,NA,NA,NA,NA,NA,NA,NA,NA,1)
v2 <- c(10,10,10,9,10,9,9,9,9,9,10,10,10,11,8,12,12,12,12,12,12,12,12,12,12,13)
fun1 <- function(v1,v2) {
for (i in 2:length(v1)){
if (!is.na(v1[i-1]) && is.na(v1[i]) && v2[i]==v2[i-1]){
v1[i]<-1
}
}
v1
}
fun2 <- function(v1,v2) {
# create groups in which we need to assess missing values
d <- cumsum(as.logical(c(0,diff(v2))))
# for each group, carry the first obs forward
ave(v1, d, FUN=function(x) na.locf(x, na.rm=FALSE))
}
fun3 <- cmpfun(fun1)
fun1(v1,v2)
fun2(v1,v2)
all.equal(fun1(v1,v2), fun2(v1,v2))
all.equal(fun1(v1,v2), fun3(v1,v2))
Nrep <- 1000
V1 <- rep(v1, each=Nrep)
V2 <- rep(v2, each=Nrep)
all.equal(fun1(V1,V2), fun2(V1,V2))
all.equal(fun1(V1,V2), fun3(V1,V2))
benchmark(fun1(V1,V2), fun2(V1,V2), fun3(V1,V2))
we get the following result
benchmark(fun1(V1,V2), fun2(V1,V2), fun3(V1,V2))
test replications elapsed relative user.self sys.self user.child
1 fun1(V1, V2) 100 12.252 5.706567 12.190 0.045 0
2 fun2(V1, V2) 100 2.147 1.000000 2.133 0.013 0
3 fun3(V1, V2) 100 3.702 1.724266 3.644 0.023 0
So the compiled fun1 is a lot faster than the original fun1 but still slower than fun2.