Why not use a for loop?

后端未结

关注

 2  601

I\'ve been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was h

Microbenchmark

set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)

library(microbenchmark)

microbenchmark(
    `Peter Miksza` = {
        outANOVA1 = list()
        for (i in names(dat)) {
            y = dat[[i]]
            outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)), 
                data = dat))
    }},
    Moody_Mudskipper = {
        outANOVA2 =
            lapply(dat,function(y)
                summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
    },
    `catastrophic_failure` = {
        outANOVA3 = 
            lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
    },
    times = 1000L)

Results

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq       max neval cld
#         Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448  1000   b
#     Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811  1000  a 
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666  68.87781  1000  a

fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0) and compiler::enableJIT(0) the following result ensues as well

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
#         Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731  1000   a
#     Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284  1000   a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036  1000   a

Conclusion

As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.

On growing lists

Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.

0 讨论(0)

执念已碎

2021-01-18 01:58

You could write it this way, it's more compact:

outANOVA <-
  lapply(dat,function(y)
    summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))

for loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.

The real crime is to use a for loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).

Notice that in this case we also could avoid to create a global variable y and that we didn't have to initialize the list outANOVA.

Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):

For loops in R are not always slower than other approaches, like apply - but there's one huge bugbear - •never grow an array inside a loop

Instead, make your arrays full-size before you loop and then fill them up.

In your case you're growing outANOVA, for big loops it could become problematic.

Here is some microbenchmark of different methods on a simple example:

n <- 100000
microbenchmark::microbenchmark(
preallocated_vec  = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec   = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list  = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply            = sapply(1:n, function(i) i^2),
lapply            = lapply(1:n, function(i) i^2),
times=20)

# Unit: milliseconds
# expr                     min         lq       mean     median         uq        max neval
# preallocated_vec    9.784237  10.100880  10.686141  10.367717  10.755598  12.839584    20
# preallocated_vec2   9.953877  10.315044  10.979043  10.514266  11.792158  12.789175    20
# incremented_vec    74.511906  79.318298  81.277439  81.640597  83.344403  85.982590    20
# preallocated_list  10.680134  11.197962  12.382082  11.416352  13.528562  18.620355    20
# incremented_list  196.759920 201.418857 212.716685 203.485940 205.441188 393.522857    20
# sapply              6.557739   6.729191   7.244242   7.063643   7.186044   9.098730    20
# lapply              6.019838   6.298750   6.835941   6.571775   6.844650   8.812273    20

0 讨论(0)