Is R's apply family more than syntactic sugar?

后端未结

关注

 5  2343

旧时难觅i 2020-11-21 22:14

...regarding execution time and / or memory.

If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must c

5条回答

忘掉有多难 (楼主)

2020-11-21 22:46

Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :

set.seed(1)  #for reproducability of the results

# The data
X <- rnorm(100000)
Y <- as.factor(sample(letters[1:5],100000,replace=T))
Z <- as.factor(sample(letters[1:10],100000,replace=T))

# the function forloop that averages X over every combination of Y and Z
forloop <- function(x,y,z){
# These ones are for optimization, so the functions 
#levels() and length() don't have to be called more than once.
  ylev <- levels(y)
  zlev <- levels(z)
  n <- length(ylev)
  p <- length(zlev)

  out <- matrix(NA,ncol=p,nrow=n)
  for(i in 1:n){
      for(j in 1:p){
          out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
      }
  }
  rownames(out) <- ylev
  colnames(out) <- zlev
  return(out)
}

# Used on the generated data
forloop(X,Y,Z)

# The same using tapply
tapply(X,list(Y,Z),mean)

Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :

> system.time(forloop(X,Y,Z))
   user  system elapsed 
   0.94    0.02    0.95 

> system.time(tapply(X,list(Y,Z),mean))
   user  system elapsed 
   0.06    0.00    0.06

There you go. What did I win? ;-)

0 讨论(0)

查看其它5个回答