I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:
column name: date / mcode / mname / ycode / yname / yissue
if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:
test replications elapsed relative
4 f4() 100 0.033 1.00
3 f3() 100 0.046 1.39
1 f1() 100 0.165 5.00
2 f2() 100 0.605 18.33
i have added my code below if someone wants to explore in more detail.
library(plyr);
library(rbenchmark);
val = rnorm(50);
name = rep(letters[1:5], each = 10);
data = data.frame(val, name);
f1 = function(){aggregate(data$val, by=list(data$name), FUN=sum)}
f2 = function(){ddply(data, .(name), summarise, sum = sum(val))}
f3 = function(){tapply(data$val, data$name, sum)}
f4 = function(){rowsum(x = data$val, group = data$name)}
benchmark(f1(), f2(), f3(), f4(),
columns=c("test", "replications", "elapsed", "relative"),
order="relative", replications=100)