How to omit rows with NA in only two columns in R?

后端未结
关注
 4  1090
梦谈多话 2021-02-04 10:39
I want to omit rows where NA appears in both of two columns.
I\'m familiar with na.omit, is.na, and compl

      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   小蘑菇
                                             
                
                
                (楼主)
            
              
              
                2021-02-04 11:19
              

            
            
                        
Use rowSums with is.na, like this:

> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
   x y  z
1  1 4  8
2  2 5  9
4  3 6 11
5 NA 7 NA




Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:

## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20), 
                                      1e6, replace = TRUE)))


First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.

f1 <- function() {
  df[!with(df, is.na(X1) & is.na(X2)), ]
} 
f2 <- function() {
  df[rowSums(is.na(df[1:2])) != 2, ]
} 

library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
#  expr      min       lq   median       uq      max neval
#  f1() 745.8378 1100.764 1128.047 1199.607 1310.236    20
#  f2() 784.2132 1101.695 1125.380 1163.675 1303.161    20


Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.

f1_5 <- function() {
  df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
             is.na(X4) & is.na(X5)), ]
} 
f2_5 <- function() {
  df[rowSums(is.na(df[1:5])) != 5, ]
} 

microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
#    expr      min       lq   median       uq      max neval
#  f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772    20
#  f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915    20

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复