I\'ve been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was h
For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.
Here's an example below, where I did a microbenchmark::microbenchmark
of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply
).
set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)
library(microbenchmark)
microbenchmark(
`Peter Miksza` = {
outANOVA1 = list()
for (i in names(dat)) {
y = dat[[i]]
outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)),
data = dat))
}},
Moody_Mudskipper = {
outANOVA2 =
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
},
`catastrophic_failure` = {
outANOVA3 =
lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
},
times = 1000L)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448 1000 b
# Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811 1000 a
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666 68.87781 1000 a
fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0)
and compiler::enableJIT(0)
the following result ensues as well
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731 1000 a
# Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284 1000 a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036 1000 a
As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.
Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.
You could write it this way, it's more compact:
outANOVA <-
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
for
loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.
The real crime is to use a for
loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).
Notice that in this case we also could avoid to create a global variable y
and that we didn't have to initialize the list outANOVA
.
Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):
For loops in R are not always slower than other approaches, like apply - but there's one huge bugbear - •never grow an array inside a loop
Instead, make your arrays full-size before you loop and then fill them up.
In your case you're growing outANOVA
, for big loops it could become problematic.
Here is some microbenchmark
of different methods on a simple example:
n <- 100000
microbenchmark::microbenchmark(
preallocated_vec = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply = sapply(1:n, function(i) i^2),
lapply = lapply(1:n, function(i) i^2),
times=20)
# Unit: milliseconds
# expr min lq mean median uq max neval
# preallocated_vec 9.784237 10.100880 10.686141 10.367717 10.755598 12.839584 20
# preallocated_vec2 9.953877 10.315044 10.979043 10.514266 11.792158 12.789175 20
# incremented_vec 74.511906 79.318298 81.277439 81.640597 83.344403 85.982590 20
# preallocated_list 10.680134 11.197962 12.382082 11.416352 13.528562 18.620355 20
# incremented_list 196.759920 201.418857 212.716685 203.485940 205.441188 393.522857 20
# sapply 6.557739 6.729191 7.244242 7.063643 7.186044 9.098730 20
# lapply 6.019838 6.298750 6.835941 6.571775 6.844650 8.812273 20