R: speeding up “group by” operations

后端 未结 5 710
挽巷
挽巷 2020-11-28 19:30

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr\'s ddply() function which works great for a huge per

相关标签:
5条回答
  • 2020-11-28 19:49

    Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.

    Here's the llply example, but the same should apply to ddply:

      x <- seq_len(20)
      wait <- function(i) Sys.sleep(0.1)
      system.time(llply(x, wait))
      #  user  system elapsed 
      # 0.007   0.005   2.005 
    
      library(doMC)
      registerDoMC(2) 
      system.time(llply(x, wait, .parallel = TRUE))
      #  user  system elapsed 
      # 0.020   0.011   1.038 
    

    Edit:

    Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using by() because that's very slow in my experience.

    groups <- unique(myDF[,c("year", "state", "group1", "group2")])
    system.time(
    aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
       df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
       cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
    }))
    )
    
    aggregateDF <- data.frame()
    system.time(
    for(i in 1:nrow(groups)) {
       df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
       aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
    }
    )
    
    0 讨论(0)
  • 2020-11-28 19:55

    I would profile with base R

    g <- with(myDF, paste(year, state, group1, group2))
    x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
    aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
    aggregateDF$V1 <- x
    

    On my machine it takes 5sec compare to 67sec with original code.

    EDIT Just found another speed up with rowsum function:

    g <- with(myDF, paste(year, state, group1, group2))
    X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
    x <- X$a/X$b
    aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
    aggregateDF2$V1 <- x
    

    It takes 3sec!

    0 讨论(0)
  • 2020-11-28 20:05

    Further 2x speedup and more concise code:

    library(data.table)
    dtb <- data.table(myDF, key="year,state,group1,group2")
    system.time( 
      res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
    )
    #   user  system elapsed 
    #  0.950   0.050   1.007 
    

    My first post, so please be nice ;)


    From data.table v1.9.2, setDT function is exported that'll convert data.frame to data.table by reference (in keeping with data.table parlance - all set* functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.

    require(data.table)
    system.time({
      setDT(myDF)
      res <- myDF[, weighted.mean(myFact, weights), 
                 by=list(year, state, group1, group2)] 
    })
    #   user  system elapsed 
    #  0.970   0.024   1.015 
    

    This is as opposed to 1.264 seconds with OP's solution above, where data.table(.) is used to create dtb.

    0 讨论(0)
  • 2020-11-28 20:09

    Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

    idf <- idata.frame(myDF)
    system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
       function(df) wtd.mean(df$myFact, weights=df$weights)))
    
    #    user  system elapsed 
    # 18.032   0.416  19.250 
    

    If I was to write a plyr function customised exactly to this situation, I'd do something like this:

    system.time({
      ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
      data <- as.matrix(myDF[c("myFact", "weights")])
      indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))
    
      fun <- function(rows) {
        weighted.mean(data[rows, 1], data[rows, 2])
      }
      values <- vapply(indices, fun, numeric(1))
    
      labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
        c("year", "state", "group1", "group2")]
      aggregateDF <- cbind(labels, values)
    })
    
    # user  system elapsed 
    # 2.04    0.29    2.33 
    

    It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

    0 讨论(0)
  • 2020-11-28 20:11

    I usually use an index vector with tapply when the function being applied has multiple vector args:

    system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
    # user  system elapsed 
    # 1.36    0.08    1.44 
    

    I use a simple wrapper which is equivalent but hides the mess:

    tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)
    

    Edited to include tmapply for comment below:

    tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
      FUN = match.fun(FUN)
      if (!is.list(XS))
        XS = list(XS)
      tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
        do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
    }
    
    0 讨论(0)
提交回复
热议问题