Aggregate with na.action=na.pass gives unexpected answer

前端 未结 1 1731
囚心锁ツ
囚心锁ツ 2021-01-05 18:54

I use the following data.frame as an example:

d <- data.frame(x=c(1,NA), y=c(2,3))

I\'d like to sum up the values of y by the variable x

相关标签:
1条回答
  • 2021-01-05 19:19

    aggregate makes use of tapply, which in turn makes use of factor on its grouping variable.

    But, look at what happens with NA values in factor:

    factor(c(1, 2, NA))
    # [1] 1    2    <NA>
    # Levels: 1 2
    

    Note the levels. You can make use of addNA to keep the NA:

    addNA(factor(c(1, 2, NA)))
    # [1] 1    2    <NA>
    # Levels: 1 2 <NA>
    

    Thus, you would probably need to do something like:

    aggregate(y ~ addNA(x), d, sum)
    #   addNA(x) y
    # 1        1 2
    # 2     <NA> 3
    

    Or something like:

    d$x <- addNA(factor(d$x))
    str(d)
    # 'data.frame': 2 obs. of  2 variables:
    #  $ x: Factor w/ 2 levels "1",NA: 1 2
    #  $ y: num  2 3
    aggregate(y ~ x, d, sum)
    #      x y
    # 1    1 2
    # 2 <NA> 3
    

    (Alternatively, make the upgrade to something like "data.table", which will not just be faster than aggregate, but which will also give you more consistent behavior with NA values. No need to pay heed to whether you're using the formula method of aggregate or not.)

    library(data.table)
    as.data.table(d)[, sum(y), by = x]
    #     x V1
    # 1:  1  2
    # 2: NA  3
    
    0 讨论(0)
提交回复
热议问题