R: speeding up “group by” operations

后端未结

关注

 5  710

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr\'s ddply() function which works great for a huge per

相关标签:

5条回答

野趣味

2020-11-28 19:49

Are you using the latest version of plyr (note: this hasn't made it to all the CRAN mirrors yet)? If so, you could just run this in parallel.

Here's the llply example, but the same should apply to ddply:

  x <- seq_len(20)
  wait <- function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed 
  # 0.007   0.005   2.005 

  library(doMC)
  registerDoMC(2) 
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed 
  # 0.020   0.011   1.038

Edit:

Well, other looping approaches are worse, so this probably requires either (a) C/C++ code or (b) a more fundamental rethinking of how you're doing it. I didn't even try using by() because that's very slow in my experience.

groups <- unique(myDF[,c("year", "state", "group1", "group2")])
system.time(
aggregateDF <- do.call("rbind", lapply(1:nrow(groups), function(i) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))
}))
)

aggregateDF <- data.frame()
system.time(
for(i in 1:nrow(groups)) {
   df.tmp <- myDF[myDF$year==groups[i,"year"] & myDF$state==groups[i,"state"] & myDF$group1==groups[i,"group1"] & myDF$group2==groups[i,"group2"],]
   aggregateDF <- rbind(aggregateDF, data.frame(cbind(groups[i,], wtd.mean(df.tmp$myFact, weights=df.tmp$weights))))
}
)

0 讨论(0)

野的像风

2020-11-28 19:55

I would profile with base R

g <- with(myDF, paste(year, state, group1, group2))
x <- with(myDF, c(tapply(weights * myFact, g, sum) / tapply(weights, g, sum)))
aggregateDF <- myDF[match(names(x), g), c("year", "state", "group1", "group2")]
aggregateDF$V1 <- x

On my machine it takes 5sec compare to 67sec with original code.

EDIT Just found another speed up with rowsum function:

g <- with(myDF, paste(year, state, group1, group2))
X <- with(myDF, rowsum(data.frame(a=weights*myFact, b=weights), g))
x <- X$a/X$b
aggregateDF2 <- myDF[match(rownames(X), g), c("year", "state", "group1", "group2")]
aggregateDF2$V1 <- x

It takes 3sec!

0 讨论(0)

死守一世寂寞

2020-11-28 20:05
Further 2x speedup and more concise code:
```
library(data.table)
dtb <- data.table(myDF, key="year,state,group1,group2")
system.time( 
  res <- dtb[, weighted.mean(myFact, weights), by=list(year, state, group1, group2)] 
)
#   user  system elapsed 
#  0.950   0.050   1.007 
```
My first post, so please be nice ;)

From data.table v1.9.2, setDT function is exported that'll convert data.frame to data.table by reference (in keeping with data.table parlance - all set* functions modify the object by reference). This means, no unnecessary copying, and is therefore fast. You can time it, but it'll be negligent.
```
require(data.table)
system.time({
  setDT(myDF)
  res <- myDF[, weighted.mean(myFact, weights), 
             by=list(year, state, group1, group2)] 
})
#   user  system elapsed 
#  0.970   0.024   1.015 
```
This is as opposed to 1.264 seconds with OP's solution above, where data.table(.) is used to create dtb.
0 讨论(0)
发布评论:

提交评论
- 加载中...

无人及你

2020-11-28 20:09

Instead of the normal R data frame, you can use a immutable data frame which returns pointers to the original when you subset and can be much faster:

idf <- idata.frame(myDF)
system.time(aggregateDF <- ddply(idf, c("year", "state", "group1", "group2"),
   function(df) wtd.mean(df$myFact, weights=df$weights)))

#    user  system elapsed 
# 18.032   0.416  19.250

If I was to write a plyr function customised exactly to this situation, I'd do something like this:

system.time({
  ids <- id(myDF[c("year", "state", "group1", "group2")], drop = TRUE)
  data <- as.matrix(myDF[c("myFact", "weights")])
  indices <- plyr:::split_indices(seq_len(nrow(data)), ids, n = attr(ids, "n"))

  fun <- function(rows) {
    weighted.mean(data[rows, 1], data[rows, 2])
  }
  values <- vapply(indices, fun, numeric(1))

  labels <- myDF[match(seq_len(attr(ids, "n")), ids), 
    c("year", "state", "group1", "group2")]
  aggregateDF <- cbind(labels, values)
})

# user  system elapsed 
# 2.04    0.29    2.33

It's so much faster because it avoids copying the data, only extracting the subset needed for each computation when it's computed. Switching the data to matrix form gives another speed boost because matrix subsetting is much faster than data frame subsetting.

0 讨论(0)

滥情空心

2020-11-28 20:11

I usually use an index vector with tapply when the function being applied has multiple vector args:

system.time(tapply(1:nrow(myDF), myDF[c('year', 'state', 'group1', 'group2')], function(s) weighted.mean(myDF$myFact[s], myDF$weights[s])))
# user  system elapsed 
# 1.36    0.08    1.44

I use a simple wrapper which is equivalent but hides the mess:

tmapply(list(myDF$myFact, myDF$weights), myDF[c('year', 'state', 'group1', 'group2')], weighted.mean)

Edited to include tmapply for comment below:

tmapply = function(XS, INDEX, FUN, ..., simplify=T) {
  FUN = match.fun(FUN)
  if (!is.list(XS))
    XS = list(XS)
  tapply(1:length(XS[[1L]]), INDEX, function(s, ...)
    do.call(FUN, c(lapply(XS, `[`, s), list(...))), ..., simplify=simplify)
}

0 讨论(0)