Is R's apply family more than syntactic sugar?

后端 未结 5 2343
旧时难觅i
旧时难觅i 2020-11-21 22:14

...regarding execution time and / or memory.

If this is not true, prove it with a code snippet. Note that speedup by vectorization does not count. The speedup must c

5条回答
  •  忘掉有多难
    2020-11-21 22:46

    Sometimes speedup can be substantial, like when you have to nest for-loops to get the average based on a grouping of more than one factor. Here you have two approaches that give you the exact same result :

    set.seed(1)  #for reproducability of the results
    
    # The data
    X <- rnorm(100000)
    Y <- as.factor(sample(letters[1:5],100000,replace=T))
    Z <- as.factor(sample(letters[1:10],100000,replace=T))
    
    # the function forloop that averages X over every combination of Y and Z
    forloop <- function(x,y,z){
    # These ones are for optimization, so the functions 
    #levels() and length() don't have to be called more than once.
      ylev <- levels(y)
      zlev <- levels(z)
      n <- length(ylev)
      p <- length(zlev)
    
      out <- matrix(NA,ncol=p,nrow=n)
      for(i in 1:n){
          for(j in 1:p){
              out[i,j] <- (mean(x[y==ylev[i] & z==zlev[j]]))
          }
      }
      rownames(out) <- ylev
      colnames(out) <- zlev
      return(out)
    }
    
    # Used on the generated data
    forloop(X,Y,Z)
    
    # The same using tapply
    tapply(X,list(Y,Z),mean)
    

    Both give exactly the same result, being a 5 x 10 matrix with the averages and named rows and columns. But :

    > system.time(forloop(X,Y,Z))
       user  system elapsed 
       0.94    0.02    0.95 
    
    > system.time(tapply(X,list(Y,Z),mean))
       user  system elapsed 
       0.06    0.00    0.06 
    

    There you go. What did I win? ;-)

提交回复
热议问题