I have a data frame (df) that has about 40 columns, and I want to aggregate using a sum on 4 of the columns. Outside of the 4 I want to sum, each unique value in column 1 co
This would be the current answer with dplyr:
library('dplyr')
mytb<-read.table(text="
A B C D Sum
1 A B C D 1
2 A B C D 2
3 A B C D 3
4 E F 1 R 4
5 E F 1 R 5", header=T, stringsAsFactors=F)
mytb %>%
group_by_at(names(select(mytb, -"Sum") ) ) %>%
summarise_all(.funs=sum)
Using the example data mentioned by @josilber, this would be another option to achieve the desired output using dplyr()
which is more efficient for huge datasets
library('dplyr')
out = agg %>%
regroup(lapply(names(select(agg, -sum)), as.symbol)) %>%
summarise_each(funs(sum))
Source: local data frame [27 x 3]
Groups: Species
# Species Petal.Width sum
#1 setosa 0.1 47.8
#2 setosa 0.2 284.1
#3 setosa 0.3 68.1
#4 setosa 0.4 74.6
#5 setosa 0.5 10.1
#6 setosa 0.6 10.1
#7 versicolor 1.0 79.9
#8 versicolor 1.1 34.3
#9 versicolor 1.2 63.8
#10 versicolor 1.3 166.5
#.. ... ... ...
using data.table
library('data.table')
out = setDT(agg)[, list(sum = sum(sum)), by= names(agg[,!"sum", with=FALSE])]
# Species Petal.Width sum
#1: setosa 0.2 284.1
#2: setosa 0.4 74.6
#3: setosa 0.3 68.1
#4: setosa 0.1 47.8
#5: setosa 0.5 10.1
#6: setosa 0.6 10.1
#7: versicolor 1.4 96.7
#8: versicolor 1.5 136.5
#9: versicolor 1.3 166.5
#10:versicolor 1.6 42.0
# ...
Use the data.frame method (aggregate.data.frame
) like this:
aggregate(df["field"], by = df[1:36], FUN = sum)
or use the formula method (aggregate.formula
) like this:
nms <- c("field", names(df)[1:36])
aggregate(field ~., df, sum)
In terms of the example data at the end of the question:
Lines <- " A B C D Sum
1 A B C D 1
2 A B C D 2
3 A B C D 3
4 E F 1 R 4
5 E F 1 R 5"
df <- read.table(text = Lines, header = TRUE)
# data.frame method
aggregate(df["Sum"], df[1:4], sum)
# data.frame method - alternative
aggregate(df[5], df[-5], sum)
# formula method
aggregate(Sum ~., df, sum)
You are asking how to aggregate the sum of multiple variables, grouped by the remaining variables. I would do this by combining the multiple variables first and then aggregating using the (in my opinion) more convenient formula interface of the aggregate
function. For instance, consider aggregating the sum of Sepal.Length, Sepal.Width, and Petal.Length in the iris dataset based on the remaining variables (Petal.Width and Species):
agg <- iris
cols <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
agg$sum <- rowSums(agg[,cols])
agg <- agg[,!names(agg) %in% cols]
aggregate(sum~., data=agg, FUN=sum)
# Petal.Width Species sum
# 1 0.1 setosa 47.8
# 2 0.2 setosa 284.1
# 3 0.3 setosa 68.1
# 4 0.4 setosa 74.6
# 5 0.5 setosa 10.1
# 6 0.6 setosa 10.1
# 7 1.0 versicolor 79.9
# 8 1.1 versicolor 34.3
# 9 1.2 versicolor 63.8
# 10 1.3 versicolor 166.5
# 11 1.4 versicolor 96.7
# 12 1.5 versicolor 136.5
# 13 1.6 versicolor 42.0
# 14 1.7 versicolor 14.7
# 15 1.8 versicolor 13.9
# 16 1.4 virginica 14.3
# 17 1.5 virginica 27.4
# 18 1.6 virginica 16.0
# 19 1.7 virginica 11.9
# 20 1.8 virginica 162.2
# 21 1.9 virginica 71.7
# 22 2.0 virginica 91.3
# 23 2.1 virginica 94.4
# 24 2.2 virginica 48.3
# 25 2.3 virginica 125.6
# 26 2.4 virginica 44.4
# 27 2.5 virginica 48.2