I use the following data.frame as an example:
d <- data.frame(x=c(1,NA), y=c(2,3))
I\'d like to sum up the values of y by the variable x
aggregate
makes use of tapply
, which in turn makes use of factor
on its grouping variable.
But, look at what happens with NA
values in factor
:
factor(c(1, 2, NA))
# [1] 1 2 <NA>
# Levels: 1 2
Note the levels
. You can make use of addNA
to keep the NA
:
addNA(factor(c(1, 2, NA)))
# [1] 1 2 <NA>
# Levels: 1 2 <NA>
Thus, you would probably need to do something like:
aggregate(y ~ addNA(x), d, sum)
# addNA(x) y
# 1 1 2
# 2 <NA> 3
Or something like:
d$x <- addNA(factor(d$x))
str(d)
# 'data.frame': 2 obs. of 2 variables:
# $ x: Factor w/ 2 levels "1",NA: 1 2
# $ y: num 2 3
aggregate(y ~ x, d, sum)
# x y
# 1 1 2
# 2 <NA> 3
(Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate
, but which will also give you more consistent behavior with NA
values. No need to pay heed to whether you're using the formula method of aggregate or not.)
library(data.table)
as.data.table(d)[, sum(y), by = x]
# x V1
# 1: 1 2
# 2: NA 3