aggregate less efficient than loops?

问题

I was trying to do this operation on a big table, to count rows with different combinations of a and b in a data.table X.

Y <- aggregate(c ~ a+b,X,length)

And it was taking forever (I stopped after 30 min) though RAM usage was still.

Then I tried to loop manually through values of b and aggregate only on a (technically still aggregating on b but with a single value of b every time) :

sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )

And I was done in 3 min.

I may as well go further and get rid of aggregate completely and only do operations on subsets.

Is aggregate less efficient than nested loops and operations on subsets or is this a special case ?

Aggregations are often the parts of codes that take the most time, so I'm now thinking of always trying loops instead, I'd like to understand better what's happening here.

Additional info:

X has 20 million rows

50 distinct values for b

15 000 distinct values for a

回答1:

Yes, aggregate is less efficient than the loops you use there, because:

aggregate becomes disproportionally slower when the number of data points increases. Your second solution uses aggregate on small subsets. One of the reasons is that aggregate depends on sorting, and sorting is not done in O(n) time.
aggregate also uses expand.grid internally, which creates a data frame with all possible combinations of all unique values in the variables a and b. You can see this in the internal code of aggregate.data.frame. Also this function becomes disproportionally slower with rising number of observations.
edit: my last point didn't really make sense as you do combine everything in a data frame.

That said, there is absolutely no reason to use aggregate here. I come to the data frame Y by simply using table :

thecounts <- with(X, table(a,b))
Y <- as.data.frame(thecounts)

This solution is a whole lot faster than the solution you came up with using aggregate. 68 times on my machine to be precise...

Benchmark:

        test replications elapsed relative 
1  aggloop()            1   15.03   68.318 
2 tableway()            1    0.22    1.000

code for benchmarking (note I made everything a bit smaller to not block my R for too long):

nrows <- 20e5

X <- data.frame(
  a = factor(sample(seq_len(15e2), nrows, replace = TRUE)),
  b = factor(sample(seq_len(50), nrows, replace = TRUE)),
  c = 1
)

aggloop <- function(){
sub_agg <- list()
unique_bs <- unique(X$b)
for (b_it in unique_bs){
  sub_agg[[length(sub_agg)+1]] <- aggregate(c ~ a + b,subset(X, b == b_it),length)
}
Y <- do.call(rbind, sub_agg )
}

tableway <- function(){
  thecounts <- with(X, table(a,b))
  Y <- as.data.frame(thecounts)
}

library(rbenchmark)

benchmark(aggloop(),
          tableway(),
          replications = 1
          )

回答2:

As suggested by @JorisMeys and to illustrate my comment(s), another way of achieving what you're after is to use data.table, which is very efficient in manipulating large data.

The general form of data.table syntax is, DT being a data.table: DT[i, j, by], meaning "Take DT, subset rows using i, then calculate j, grouped by by".
For example, the code to get the counts by each a and b level in X is: X[, .N, by=c("a", "b")].

You can read more about data.table in this introduction to the package.

If we want to benchmark data.table way with the other ways, using the same example data X and the functions defined in JorisMeys' answer:

library(data.table)
X2 <- copy(X) # taking a copy of X so the conversion to data.table does not impact the initial data

dtway <- function(){
            setDT(X2)[, .N, by=c("a", "b")] # setDT permits to convert X2 into a data.table
         }

library(rbenchmark)
benchmark(aggloop(),
          tableway(),
          dtway(),
          replications = 1)

        # test replications elapsed relative
# 1  aggloop()            1   17.29  192.111
# 3    dtway()            1    0.09    1.000
# 2 tableway()            1    0.27    3.000

Note: the efficiencies are dependent on the data, I tried several X (with different random seeds) and found relative efficiency from 1/2.5 to 1/3.5 for data.table relative to base R with table.

来源：https://stackoverflow.com/questions/42629386/aggregate-less-efficient-than-loops

标签

loops

aggregate