Is R's apply family more than syntactic sugar?

后端 未结 5 2333
旧时难觅i
旧时难觅i 2020-11-21 22:14

...regarding execution time and / or memory.

If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must c

5条回答
  •  遇见更好的自我
    2020-11-21 22:42

    I've written elsewhere that an example like Shane's doesn't really stress the difference in performance among the various kinds of looping syntax because the time is all spent within the function rather than actually stressing the loop. Furthermore, the code unfairly compares a for loop with no memory with apply family functions that return a value. Here's a slightly different example that emphasizes the point.

    foo <- function(x) {
       x <- x+1
     }
    y <- numeric(1e6)
    system.time({z <- numeric(1e6); for(i in y) z[i] <- foo(i)})
    #   user  system elapsed 
    #  4.967   0.049   7.293 
    system.time(z <- sapply(y, foo))
    #   user  system elapsed 
    #  5.256   0.134   7.965 
    system.time(z <- lapply(y, foo))
    #   user  system elapsed 
    #  2.179   0.126   3.301 
    

    If you plan to save the result then apply family functions can be much more than syntactic sugar.

    (the simple unlist of z is only 0.2s so the lapply is much faster. Initializing the z in the for loop is quite fast because I'm giving the average of the last 5 of 6 runs so moving that outside the system.time would hardly affect things)

    One more thing to note though is that there is another reason to use apply family functions independent of their performance, clarity, or lack of side effects. A for loop typically promotes putting as much as possible within the loop. This is because each loop requires setup of variables to store information (among other possible operations). Apply statements tend to be biased the other way. Often times you want to perform multiple operations on your data, several of which can be vectorized but some might not be able to be. In R, unlike other languages, it is best to separate those operations out and run the ones that are not vectorized in an apply statement (or vectorized version of the function) and the ones that are vectorized as true vector operations. This often speeds up performance tremendously.

    Taking Joris Meys example where he replaces a traditional for loop with a handy R function we can use it to show the efficiency of writing code in a more R friendly manner for a similar speedup without the specialized function.

    set.seed(1)  #for reproducability of the results
    
    # The data - copied from Joris Meys answer
    X <- rnorm(100000)
    Y <- as.factor(sample(letters[1:5],100000,replace=T))
    Z <- as.factor(sample(letters[1:10],100000,replace=T))
    
    # an R way to generate tapply functionality that is fast and 
    # shows more general principles about fast R coding
    YZ <- interaction(Y, Z)
    XS <- split(X, YZ)
    m <- vapply(XS, mean, numeric(1))
    m <- matrix(m, nrow = length(levels(Y)))
    rownames(m) <- levels(Y)
    colnames(m) <- levels(Z)
    m
    

    This winds up being much faster than the for loop and just a little slower than the built in optimized tapply function. It's not because vapply is so much faster than for but because it is only performing one operation in each iteration of the loop. In this code everything else is vectorized. In Joris Meys traditional for loop many (7?) operations are occurring in each iteration and there's quite a bit of setup just for it to execute. Note also how much more compact this is than the for version.

提交回复
热议问题