How to omit rows with NA in only two columns in R?

后端未结

关注

 4  1085

I want to omit rows where NA appears in both of two columns.

I\'m familiar with na.omit, is.na, and compl


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2021-02-04 11:17
              
            
            
                                                                       
df[!with(df,is.na(x)& is.na(y)),]
#      x y  z
#1  1 4  8
#2  2 5  9
#4  3 6 11
#5 NA 7 NA


I did benchmarked on a slightly bigger dataset. Here are the results:

set.seed(237)
df <- data.frame(x=sample(c(NA,1:20), 1e6, replace=T), y= sample(c(NA, 1:10), 1e6, replace=T), z= sample(c(NA, 5:15), 1e6,replace=T)) 

f1 <- function() df[!with(df,is.na(x)& is.na(y)),]
f2 <- function() df[rowSums(is.na(df[c("x", "y")])) != 2, ]
f3 <- function()  df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] 

library(microbenchmark)

microbenchmark(f1(), f2(), f3(), unit="relative")
Unit: relative
#expr       min        lq    median        uq       max neval
# f1()  1.000000  1.000000  1.000000  1.000000  1.000000   100
# f2()  1.044812  1.068189  1.138323  1.129611  0.856396   100
# f3() 26.205272 25.848441 24.357665 21.799930 22.881378   100

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2021-02-04 11:19
              
            
            
                                                                       
Use rowSums with is.na, like this:

> df[rowSums(is.na(df[c("x", "y")])) != 2, ]
   x y  z
1  1 4  8
2  2 5  9
4  3 6 11
5 NA 7 NA




Jumping on the benchmarking wagon, and demonstrating what I was referring to about this being a fairly easy-to-generalize solution, consider the following:

## Sample data with 10 columns and 1 million rows
set.seed(123)
df <- data.frame(replicate(10, sample(c(NA, 1:20), 
                                      1e6, replace = TRUE)))


First, here's what things look like if you're just interested in two columns. Both solutions are pretty legible and short. Speed is quite close.

f1 <- function() {
  df[!with(df, is.na(X1) & is.na(X2)), ]
} 
f2 <- function() {
  df[rowSums(is.na(df[1:2])) != 2, ]
} 

library(microbenchmark)
microbenchmark(f1(), f2(), times = 20)
# Unit: milliseconds
#  expr      min       lq   median       uq      max neval
#  f1() 745.8378 1100.764 1128.047 1199.607 1310.236    20
#  f2() 784.2132 1101.695 1125.380 1163.675 1303.161    20


Next, let's look at the same problem, but this time, we are considering NA values across the first 5 columns. At this point, the rowSums approach is slightly faster and the syntax does not change much.

f1_5 <- function() {
  df[!with(df, is.na(X1) & is.na(X2) & is.na(X3) &
             is.na(X4) & is.na(X5)), ]
} 
f2_5 <- function() {
  df[rowSums(is.na(df[1:5])) != 5, ]
} 

microbenchmark(f1_5(), f2_5(), times = 20)
# Unit: seconds
#    expr      min       lq   median       uq      max neval
#  f1_5() 1.275032 1.294777 1.325957 1.368315 1.572772    20
#  f2_5() 1.088564 1.169976 1.193282 1.225772 1.275915    20

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-02-04 11:20
              
            
            
                                                                       
dplyr solution

require("dplyr")
df %>% filter_at(.vars = vars(x, y), .vars_predicate = any_vars(!is.na(.)))


can be modified to take any number columns using the .vars argument
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-02-04 11:37
              
            
            
                                                                       
You can apply to slice up the rows: 

sel <- apply( df, 1, function(x) sum(is.na(x))>1 )


Then you can select with that:

df[ sel, ]


To ignore the z column, just omit it from the apply:

sel <- apply( df[,c("x","y")], 1, function(x) sum(is.na(x))>1 )


If they all have to be TRUE, just change the function up a little:

sel <- apply( df[,c("x","y")], 1, function(x) all(is.na(x)) )


The other solutions here are more specific to this particular problem, but apply is worth learning as it solves many other problems.  The cost is speed (usual caveats about small datasets and speed testing apply):

> microbenchmark( df[!with(df,is.na(x)& is.na(y)),], df[rowSums(is.na(df[c("x", "y")])) != 2, ], df[ apply( df, 1, function(x) sum(is.na(x))>1 ), ] )
Unit: microseconds
                                              expr     min       lq   median       uq      max neval
              df[!with(df, is.na(x) & is.na(y)), ]  67.148  71.5150  76.0340  86.0155 1049.576   100
        df[rowSums(is.na(df[c("x", "y")])) != 2, ] 132.064 139.8760 145.5605 166.6945  498.934   100
 df[apply(df, 1, function(x) sum(is.na(x)) > 1), ] 175.372 184.4305 201.6360 218.7150  321.583   100

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复