Does anybody know how to aggregate by NA in R.
If you take the example below
a <- matrix(1,5,2)
a[1:2,2] <- NA
a[3:5,2] <- 2
aggregate(a[,1]
Instead of aggregate()
, you may want to consider rowsum()
. It is actually designed for this exact operation on matrices and is known to be much faster than aggregate()
. We can add NA
to the factor levels of a[, 2]
with addNA()
. This will assure that NA
shows up as a grouping variable.
rowsum(a[, 1], addNA(a[, 2]))
# [,1]
# 2 3
# <NA> 2
If you still want to use aggregate()
, you can incorporate addNA()
as well.
aggregate(a[, 1], list(Group = addNA(a[, 2])), sum)
# Group x
# 1 2 3
# 2 <NA> 2
And one more option with data.table -
library(data.table)
as.data.table(a)[, .(x = sum(V1)), by = .(Group = V2)]
# Group x
# 1: NA 2
# 2: 2 3
Use summarize from dplyr
library(dplyr)
a %>%
as.data.frame %>%
group_by(V2) %>%
summarize(V1_sum = sum(V1))
Using
sqldf
:
a <- as.data.frame(a)
sqldf("SELECT V2 [Group], SUM(V1) x
FROM a
GROUP BY V2")
Output:
Group x
1 NA 2
2 2 3
stats package
A variation of AdamO's proposal:
data.frame(xtabs( V1 ~ V2 , data = a,na.action = na.pass, exclude = NULL))
Output:
V2 Freq
1 2 3
2 <NA> 2
You can also try aggregating by is.na(a[,2])
instead.
aggregate(a[,1], by=list(is.na(a[,2])), sum)
# Group.1 x
# 1 FALSE 3
# 2 TRUE 2
If you want a finer distinction than just NA
or not, then you may want to define a new variable that uses an previously unused value to denote NA
(a factor would be more elegant, but a numeric vector is the simplest):
b <- a[,2]
b[is.na(b)] <- 999
aggregate(a[,1], by=list(b), sum)
# Group.1 x
# 1 2 3
# 2 999 2
The addNA
solution of Rich doesn't require any substantial change to the aggregate
syntax, so I think it's the best solution. I'll point out that another option, which produces output similar to table
(and thus can be coerced into a data.frame
structure similar to that of aggregate
) is xtabs
.
xtabs(a[, 1] ~ a[, 2], addNA=T)
Gives:
Group.1 x
1 2 3
2 <NA> 2
Another "trick" I see is assigning a missing code to these data. We all like the NA
output of R, but assigning a missing code to a grouping variable is a good coding exercise. We take it so that it has one more digit than the largest value in the dataset and is of the form -999...99.
codemiss <- function(x) -10^(floor(log(max(abs(x), na.rm=T), base=10))+2)-1
works in general.
Then you get
a[, 2][is.na(a[, 2])] <- codemiss(a[, 2])
And:
aggregate(a[, 1], list(a[, 2]), sum)
Gives you:
Group.1 x
1 -99 2
2 2 3