R aggregate by large number of columns

后端未结

关注

 4  597

I have a data frame (df) that has about 40 columns, and I want to aggregate using a sum on 4 of the columns. Outside of the 4 I want to sum, each unique value in column 1 co

相关标签:

4条回答

春和景丽

2020-12-20 08:17

This would be the current answer with dplyr:

library('dplyr')
mytb<-read.table(text="
A B C D Sum
1 A B C D   1
2 A B C D   2
3 A B C D   3
4 E F 1 R   4
5 E F 1 R   5", header=T, stringsAsFactors=F)

mytb %>% 
  group_by_at(names(select(mytb, -"Sum") ) )  %>% 
  summarise_all(.funs=sum)

0 讨论(0)

野性不改

2020-12-20 08:19

Using the example data mentioned by @josilber, this would be another option to achieve the desired output using dplyr() which is more efficient for huge datasets

library('dplyr')

out = agg %>% 
regroup(lapply(names(select(agg, -sum)), as.symbol)) %>% 
summarise_each(funs(sum))

Source: local data frame [27 x 3]
Groups: Species

#  Species Petal.Width   sum
#1      setosa         0.1  47.8
#2      setosa         0.2 284.1
#3      setosa         0.3  68.1
#4      setosa         0.4  74.6
#5      setosa         0.5  10.1
#6      setosa         0.6  10.1
#7  versicolor         1.0  79.9
#8  versicolor         1.1  34.3
#9  versicolor         1.2  63.8
#10 versicolor         1.3 166.5
#..        ...         ...   ...

using data.table

library('data.table')

out = setDT(agg)[, list(sum = sum(sum)), by= names(agg[,!"sum", with=FALSE])]

#  Species Petal.Width   sum
#1:     setosa         0.2 284.1
#2:     setosa         0.4  74.6
#3:     setosa         0.3  68.1
#4:     setosa         0.1  47.8
#5:     setosa         0.5  10.1
#6:     setosa         0.6  10.1
#7: versicolor         1.4  96.7
#8: versicolor         1.5 136.5
#9: versicolor         1.3 166.5
#10:versicolor         1.6  42.0
# ...

0 讨论(0)

野的像风

2020-12-20 08:26

Use the data.frame method (aggregate.data.frame) like this:

aggregate(df["field"], by = df[1:36], FUN = sum)

or use the formula method (aggregate.formula) like this:

nms <- c("field", names(df)[1:36])
aggregate(field ~., df, sum)

In terms of the example data at the end of the question:

Lines <- " A B C D Sum
1 A B C D   1
2 A B C D   2
3 A B C D   3
4 E F 1 R   4
5 E F 1 R   5"
df <- read.table(text = Lines, header = TRUE)

# data.frame method
aggregate(df["Sum"], df[1:4], sum)

# data.frame method - alternative
aggregate(df[5], df[-5], sum)

# formula method
aggregate(Sum ~., df, sum)

0 讨论(0)

不思量自难忘°

2020-12-20 08:32

You are asking how to aggregate the sum of multiple variables, grouped by the remaining variables. I would do this by combining the multiple variables first and then aggregating using the (in my opinion) more convenient formula interface of the aggregate function. For instance, consider aggregating the sum of Sepal.Length, Sepal.Width, and Petal.Length in the iris dataset based on the remaining variables (Petal.Width and Species):

agg <- iris
cols <- c("Sepal.Length", "Sepal.Width", "Petal.Length")
agg$sum <- rowSums(agg[,cols])
agg <- agg[,!names(agg) %in% cols]
aggregate(sum~., data=agg, FUN=sum)
#    Petal.Width    Species   sum
# 1          0.1     setosa  47.8
# 2          0.2     setosa 284.1
# 3          0.3     setosa  68.1
# 4          0.4     setosa  74.6
# 5          0.5     setosa  10.1
# 6          0.6     setosa  10.1
# 7          1.0 versicolor  79.9
# 8          1.1 versicolor  34.3
# 9          1.2 versicolor  63.8
# 10         1.3 versicolor 166.5
# 11         1.4 versicolor  96.7
# 12         1.5 versicolor 136.5
# 13         1.6 versicolor  42.0
# 14         1.7 versicolor  14.7
# 15         1.8 versicolor  13.9
# 16         1.4  virginica  14.3
# 17         1.5  virginica  27.4
# 18         1.6  virginica  16.0
# 19         1.7  virginica  11.9
# 20         1.8  virginica 162.2
# 21         1.9  virginica  71.7
# 22         2.0  virginica  91.3
# 23         2.1  virginica  94.4
# 24         2.2  virginica  48.3
# 25         2.3  virginica 125.6
# 26         2.4  virginica  44.4
# 27         2.5  virginica  48.2

0 讨论(0)