问题
I wanted to sum individual columns by group and my first thought was to use tapply
.
However, I cannot get tapply
to work. Can tapply
be used to sum multiple columns?
If not, why not?
I have searched the internet extensively and found numerous similar questions posted as far back as 2008. However, none of those questions have been answered directly. Instead, the responses invariably suggest using a different function.
Below is an example data set for which I wish to sum apples by state, cherries by state
and plums by state. Below that I have compiled numerous alternatives to tapply
that
do work.
At the bottom I show a simple modification to the tapply
source code that allows
tapply
to perform the desired operation.
Nevertheless, perhaps I am overlooking a simple way to perform the desired operation
with tapply
. I am not looking for alternative functions, although additional alternatives are welcome.
Given the simplicity of my modification to the tapply
source code I wonder why it, or
something similar, has not already been implemented.
Thank you for any advice. If my question is a duplicate I will be happy to post my question as an answer to that other question.
Here is the example data set:
df.1 <- read.table(text = '
state county apples cherries plums
AA 1 1 2 3
AA 2 10 20 30
AA 3 100 200 300
BB 7 -1 -2 -3
BB 8 -10 -20 -30
BB 9 -100 -200 -300
', header = TRUE, stringsAsFactors = FALSE)
This does not work:
tapply(df.1, df.1$state, function(x) {colSums(x[,3:5])})
The help pages says:
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
X an atomic object, typically a vector.
I was confused by the phrase typically a vector
which made me wonder whether
a data frame could be used. I have never been clear on what atomic object
means.
Here are several alternatives to tapply
that do work. The first alternative is a work-around that combines tapply
with apply
.
apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), colSums))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))
# apples cherries plums
# AA 111 222 333
# BB -111 -222 -333
aggregate(df.1[,3:5], by=list(df.1$state), sum)
# Group.1 apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
by(df.1[,3:5], df.1$state, colSums)
# df.1$state: AA
# apples cherries plums
# 111 222 333
# ------------------------------------------------------------
# df.1$state: BB
# apples cherries plums
# -111 -222 -333
with(df.1,
aggregate(x = list(apples = apples,
cherries = cherries,
plums = plums),
by = list(state = state),
FUN = function(x) sum(x)))
# state apples cherries plums
# 1 AA 111 222 333
# 2 BB -111 -222 -333
lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )
# $AA
# apples cherries plums
# 111 222 333
#
# $BB
# apples cherries plums
# -111 -222 -333
Here is the source code for tapply
except that I changed the line:
nx <- length(X)
to:
nx <- ifelse(is.vector(X), length(X), dim(X)[1])
This modified version of tapply
performs the desired operation:
my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
FUN <- if (!is.null(FUN)) match.fun(FUN)
if (!is.list(INDEX)) INDEX <- list(INDEX)
nI <- length(INDEX)
if (!nI) stop("'INDEX' is of length zero")
namelist <- vector("list", nI)
names(namelist) <- names(INDEX)
extent <- integer(nI)
nx <- ifelse(is.vector(X), length(X), dim(X)[1]) # replaces nx <- length(X)
one <- 1L
group <- rep.int(one, nx) #- to contain the splitting vector
ngroup <- one
for (i in seq_along(INDEX)) {
index <- as.factor(INDEX[[i]])
if (length(index) != nx)
stop("arguments must have same length")
namelist[[i]] <- levels(index)#- all of them, yes !
extent[i] <- nlevels(index)
group <- group + ngroup * (as.integer(index) - one)
ngroup <- ngroup * nlevels(index)
}
if (is.null(FUN)) return(group)
ans <- lapply(X = split(X, group), FUN = FUN, ...)
index <- as.integer(names(ans))
if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
ansmat <- array(dim = extent, dimnames = namelist)
ans <- unlist(ans, recursive = FALSE)
} else {
ansmat <- array(vector("list", prod(extent)),
dim = extent, dimnames = namelist)
}
if(length(index)) {
names(ans) <- NULL
ansmat[index] <- ans
}
ansmat
}
my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})
# AA BB
# 111 -111
my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})
# $AA
# apples cherries
# 111 222
#
# $BB
# apples cherries
# -111 -222
回答1:
tapply
works on a vector, for a data.frame you can use by
(which is a wrapper for tapply
, take a look at the code):
> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
apples cherries plums
111 222 333
-------------------------------------------------------------------------------------
df.1$state: BB
apples cherries plums
-111 -222 -333
回答2:
You're looking for by
. It uses the INDEX
in the way that you assumed tapply
would, by row.
by(df.1, df.1$state, function(x) colSums(x[,3:5]))
The problem with your use of tapply
is that you were indexing the data.frame
by column. (Because data.frame
is really just a list
of columns.) So, tapply
complained that your index didn't match the length of your data.frame
which is 5.
回答3:
I looked at the source code for by
, as EDi suggested. That code was substantially more complex than my change to the one line in tapply
. I have now found that my.tapply
does not work with the more complex scenario below where apples
and cherries
are summed by state
and county
. If I get my.tapply
to work with this case I can post the code here later:
df.2 <- read.table(text = '
state county apples cherries plums
AA 1 1 2 3
AA 1 1 2 3
AA 2 10 20 30
AA 2 10 20 30
AA 3 100 200 300
AA 3 100 200 300
BB 7 -1 -2 -3
BB 7 -1 -2 -3
BB 8 -10 -20 -30
BB 8 -10 -20 -30
BB 9 -100 -200 -300
BB 9 -100 -200 -300
', header = TRUE, stringsAsFactors = FALSE)
# my function works
tapply(df.2$apples , list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$apples , list(df.2$state, df.2$county), function(x) {sum(x)})
# my function works
tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
# my function does not work
my.tapply(df.2[,3:4], list(df.2$state, df.2$county), function(x) {colSums(x)})
来源:https://stackoverflow.com/questions/17903205/sum-multiple-columns-by-group-with-tapply