How to omit rows with NA in only two columns in R?

后端 未结 4 1078
梦谈多话
梦谈多话 2021-02-04 10:39

I want to omit rows where NA appears in both of two columns.

I\'m familiar with na.omit, is.na, and compl

相关标签:
4条回答
  • 2021-02-04 11:17
    df[!with(df,is.na(x)& is.na(y)),]
    #      x y  z
    #1  1 4  8
    #2  2 5  9
    #4  3 6 11
    #5 NA 7 NA
    

    I did benchmarked on a slightly bigger dataset. Here are the results:

    set.seed(237)
    df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T)) 
    
    f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
    f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
    f3 <- function()  df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] 
    
    library(microbenchmark)
    
    microbenchmark(f1(), f2(), f3(), unit="relative")
    Unit: relative
    #expr       min        lq    median        uq       max neval
    # f1()  1.000000  1.000000  1.000000  1.000000  1.000000   100
    # f2()  1.044812  1.068189  1.138323  1.129611  0.856396   100
    # f3() 26.205272 25.848441 24.357665 21.799930 22.881378   100
    
    0 讨论(0)
  • 2021-02-04 11:19

    Use rowSums with is.na, like this:

    > df[rowSums(is.na(df[c("x", "y")])) != 2, ]
       x y  z
    1  1 4  8
    2  2 5  9
    4  3 6 11
    5 NA 7 NA
    

    Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:

    ## Sample data with 10 columns and 1 million rows
    set.seed(123)
    df <- data.frame(replicate(10, sample(c(NA, 1:20), 
                                          1e6, replace = TRUE)))
    

    First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.

    f1 <- function() {
      df[!with(df, is.na(X1) & is.na(X2)), ]
    } 
    f2 <- function() {
      df[rowSums(is.na(df[1:2])) != 2, ]
    } 
    
    library(microbenchmark)
    microbenchmark(f1(), f2(), times = 20)
    # Unit: milliseconds
    #  expr      min       lq   median       uq      max neval
    #  f1() 745.8378 1100.764 1128.047 1199.607 1310.236    20
    #  f2() 784.2132 1101.695 1125.380 1163.675 1303.161    20
    

    Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.

    f1_5 <- function() {
      df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
                 is.na(X4) & is.na(X5)), ]
    } 
    f2_5 <- function() {
      df[rowSums(is.na(df[1:5])) != 5, ]
    } 
    
    microbenchmark(f1_5(), f2_5(), times = 20)
    # Unit: seconds
    #    expr      min       lq   median       uq      max neval
    #  f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772    20
    #  f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915    20
    
    0 讨论(0)
  • 2021-02-04 11:20

    dplyr solution

    require("dplyr")
    df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))
    

    can be modified to take any number columns using the .vars argument

    0 讨论(0)
  • 2021-02-04 11:37

    You can apply to slice up the rows:

    sel <- apply( df, 1, function(x) sum(is.na(x))>1 )
    

    Then you can select with that:

    df[ sel, ]
    

    To ignore the z column, just omit it from the apply:

    sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )
    

    If they all have to be TRUE, just change the function up a little:

    sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )
    

    The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems. The cost is speed (usual caveats about small datasets and speed testing apply):

    > microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
    Unit: microseconds
                                                  expr     min       lq   median       uq      max neval
                  df[!with(df, is.na(x) & is.na(y)), ]  67.148  71.5150  76.0340  86.0155 1049.576   100
            df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945  498.934   100
     df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150  321.583   100
    
    0 讨论(0)
提交回复
热议问题