ignore NA in dplyr row sum

后端 未结 6 2095
自闭症患者
自闭症患者 2020-11-27 17:27

is there an elegant way to handle NA as 0 (na.rm = TRUE) in dplyr?

data <- data.frame(a=c(1,2,3,4), b=c(4,NA,5,6), c=c(7,8,9,NA))

data %>% mutate(sum          


        
相关标签:
6条回答
  • 2020-11-27 17:33

    With new dplyr 1.0.0 you can use c_across with rowwise.

    library(dplyr)
    
    data %>%
      rowwise() %>%
      mutate(sum = sum(c_across(a:c), na.rm = TRUE))
    
    #      a     b     c   sum
    #  <dbl> <dbl> <dbl> <dbl>
    #1     1     4     7    12
    #2     2    NA     8    10
    #3     3     5     9    17
    #4     4     6    NA    10
    
    0 讨论(0)
  • 2020-11-27 17:41

    Try this

    data$sum <- apply(data, 1, sum, na.rm = T)
    

    Resulting data is

    a  b  c sum
    1 1  4  7  12
    2 2 NA  8  10
    3 3  5  9  17
    4 4  6 NA  10
    
    0 讨论(0)
  • 2020-11-27 17:43

    Another option:

    data %>%
      mutate(sum = rowSums(., na.rm = TRUE))
    

    Benchmark

    library(microbenchmark)
    mbm <- microbenchmark(
    steven = data %>% mutate(sum = rowSums(., na.rm = TRUE)), 
    lyz    = data %>% rowwise() %>% mutate(sum = sum(a, b, c, na.rm=TRUE)),
    nar    = apply(data, 1, sum, na.rm = TRUE),
    akrun  = data %>% mutate_each(funs(replace(., which(is.na(.)), 0))) %>% mutate(sum=a+b+c),
    frank  = data %>% mutate(sum = Reduce(function(x,y) x + replace(y, is.na(y), 0), ., 
                                         init=rep(0, n()))),
    times = 10)
    

    #Unit: milliseconds
    #   expr         min          lq       mean     median         uq        max neval cld
    # steven    9.493812    9.558736   18.31476   10.10280   22.55230   65.15325    10 a  
    #    lyz 6791.690570 6836.243782 6978.29684 6915.16098 7138.67733 7321.61117    10   c
    #    nar  702.537055  723.256808  799.79996  805.71028  849.43815  909.36413    10  b 
    #  akrun   11.372550   11.388473   28.49560   11.44698   20.21214  155.23165    10 a  
    #  frank   20.206747   20.695986   32.69899   21.12998   25.11939  118.14779    10 a 
    
    0 讨论(0)
  • 2020-11-27 17:45

    Here's a similar approach to Steven's, but includes dplyr::select() to explicitly state which columns to include/ignore (like ID variables).

    data %>% 
      mutate(sum = rowSums(dplyr::select(., a, b, c), na.rm = TRUE))
    

    It has comparable performance with a realistically-sized dataset. I'm not sure why though, since no columns are actually being excluded in this skinny example.

    Bigger dataset of 1M rows:

    pick <- function() { sample(c(1:5, NA), 1000000, replace=T) }
    data <- data.frame(a=pick(), b=pick(), c=pick())
    

    Results:

    Unit: milliseconds
         expr         min          lq        mean      median          uq         max neval cld
       steven    22.05847    22.96164    56.84822    28.85411    54.99691   174.58447    10 a  
    wibeasley    25.10274    26.98303    30.66911    29.30630    30.63343    49.46048    10 a  
          lyz 10408.89904 10548.33756 10887.51930 10720.92372 11017.56256 12250.41370    10   c
          nar  1975.35941  2011.36445  2123.81705  2090.43174  2172.80501  2362.13658    10  b 
        akrun    31.27247    35.41943    81.33320    57.93900    63.59119   302.21059    10 a  
        frank    37.48265    38.72270    65.02965    41.62735    44.45775   261.79898    10 a  
    
    0 讨论(0)
  • 2020-11-27 17:48

    You could use this:

    library(dplyr)
    data %>% 
      #rowwise will make sure the sum operation will occur on each row
      rowwise() %>% 
      #then a simple sum(..., na.rm=TRUE) is enough to result in what you need
      mutate(sum = sum(a,b,c, na.rm=TRUE))
    

    Output:

    Source: local data frame [4 x 4]
    Groups: <by row>
    
          a     b     c   sum
      (dbl) (dbl) (dbl) (dbl)
    1     1     4     7    12
    2     2    NA     8    10
    3     3     5     9    17
    4     4     6    NA    10
    
    0 讨论(0)
  • 2020-11-27 17:51

    Or we can replace NA with 0 and then use the OP's code

    data %>% 
       mutate_each(funs(replace(., which(is.na(.)), 0))) %>%
       mutate(Sum= a+b+c)
       #or as @Frank mentioned
       #mutate(Sum = Reduce(`+`, .))
    

    Based on the benchmarks using @Steven Beaupré data, it seems to be efficient as well.

    0 讨论(0)
提交回复
热议问题