There are several reasons why one might prefer an apply
family function over a for
loop, or vice-versa.
Firstly, for()
and apply()
, sapply()
will generally be just as quick as each other if executed correctly. lapply()
does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply()
. In the end, these all will be calling R functions so they need to be interpreted and then run.
for()
loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply
family functions. However, to use for()
loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply
functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as >
is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for()
loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply
family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply
family lend themselves to scalar or vector operations. A for()
loop will often lend itself to doing multiple iterated operations using the same index i
. For example, I have written code that uses for()
loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply
family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply()
can possibly be faster that for()
or apply()
, you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()
) then that is where the performance gain can come from over apply()
say which boils down to a for()
loop in actual R code. See the source for apply()
to see that it is a wrapper around a for()
loop, and then look at the code for lapply()
, which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply()
and for()
and the other apply
family functions. The .Internal()
is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN
, the entire computation is done in C, calling the R function FUN
. Compare that with the source for apply()
.