How to delete columns that contain ONLY NAs?

后端未结

关注

 7  1098

I have a data.frame containing some columns with all NA values, how can I delete them from the data.frame.

Can I use the function

na.omit(...)


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2020-11-28 04:45
              
            
            
                                                                       
Here is a dplyr solution:

df %>% select_if(~sum(!is.na(.)) > 0)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-11-28 04:47
              
            
            
                                                                       
Because performance was really important for me, I benchmarked all the functions above.

NOTE: Data from @Simon O'Hanlon's post. Only with size 15000 instead of 10.

library(tidyverse)
library(microbenchmark)

set.seed(123)
df <- data.frame(id = 1:15000,
                 nas = rep(NA, 15000), 
                 vals = sample(c(1:3, NA), 15000,
                               repl = TRUE))
df

MadSconeF1 <- function(x) x[, colSums(is.na(x)) != nrow(x)]

MadSconeF2 <- function(x) x[colSums(!is.na(x)) > 0]

BradCannell <- function(x) x %>% select_if(~sum(!is.na(.)) > 0)

SimonOHanlon <- function(x) x[ , !apply(x, 2 ,function(y) all(is.na(y)))]

jsta <- function(x) janitor::remove_empty(x)

SiboJiang <- function(x) x %>% dplyr::select_if(~!all(is.na(.)))

akrun <- function(x) Filter(function(y) !all(is.na(y)), x)

mbm <- microbenchmark(
  "MadSconeF1" = {MadSconeF1(df)},
  "MadSconeF2" = {MadSconeF2(df)},
  "BradCannell" = {BradCannell(df)},
  "SimonOHanlon" = {SimonOHanlon(df)},
  "SiboJiang" = {SiboJiang(df)},
  "jsta" = {jsta(df)}, 
  "akrun" = {akrun(df)},
  times = 1000)

mbm


Results:

Unit: microseconds
         expr    min      lq      mean  median      uq      max neval  cld
   MadSconeF1  154.5  178.35  257.9396  196.05  219.25   5001.0  1000 a   
   MadSconeF2  180.4  209.75  281.2541  226.40  251.05   6322.1  1000 a   
  BradCannell 2579.4 2884.90 3330.3700 3059.45 3379.30  33667.3  1000    d
 SimonOHanlon  511.0  565.00  943.3089  586.45  623.65 210338.4  1000  b  
    SiboJiang 2558.1 2853.05 3377.6702 3010.30 3310.00  89718.0  1000    d
         jsta 1544.8 1652.45 2031.5065 1706.05 1872.65  11594.9  1000   c 
        akrun   93.8  111.60  139.9482  121.90  135.45   3851.2  1000 a


autoplot(mbm)




mbm %>% 
  tbl_df() %>%
  ggplot(aes(sample = time)) + 
  stat_qq() + 
  stat_qq_line() +
  facet_wrap(~expr, scales = "free")



                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2020-11-28 04:48
              
            
            
                                                                       
Another option is the janitor package:

df <- remove_empty_cols(df)


https://github.com/sfirke/janitor
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-28 04:52
              
            
            
                                                                       
It seeems like you want to remove ONLY columns with ALL NAs, leaving columns with some rows that do have NAs. I would do this (but I am sure there is an efficient vectorised soution:

#set seed for reproducibility
set.seed <- 103
df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )
df
#      id nas vals
#   1   1  NA   NA
#   2   2  NA    2
#   3   3  NA    1
#   4   4  NA    2
#   5   5  NA    2
#   6   6  NA    3
#   7   7  NA    2
#   8   8  NA    3
#   9   9  NA    3
#   10 10  NA    2

#Use this command to remove columns that are entirely NA values, it will elave columns where only some vlaues are NA
df[ , ! apply( df , 2 , function(x) all(is.na(x)) ) ]
#      id vals
#   1   1   NA
#   2   2    2
#   3   3    1
#   4   4    2
#   5   5    2
#   6   6    3
#   7   7    2
#   8   8    3
#   9   9    3
#   10 10    2


If you find yourself in the situation where you want to remove columns that have any NA values you can simply change the all command above to any.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2020-11-28 04:52
              
            
            
                                                                       
An intuitive script: dplyr::select_if(~!all(is.na(.))).  It literally keeps only not-all-elements-missing columns. (to delete all-element-missing columns). 

> df <- data.frame( id = 1:10 , nas = rep( NA , 10 ) , vals = sample( c( 1:3 , NA ) , 10 , repl = TRUE ) )

> df %>% glimpse()
Observations: 10
Variables: 3
$ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ nas  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
$ vals <int> NA, 1, 1, NA, 1, 1, 1, 2, 3, NA

> df %>% select_if(~!all(is.na(.))) 
   id vals
1   1   NA
2   2    1
3   3    1
4   4   NA
5   5    1
6   6    1
7   7    1
8   8    2
9   9    3
10 10   NA

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-11-28 04:54
              
            
            
                                                                       
One way of doing it:

df[, colSums(is.na(df)) != nrow(df)]


If the count of NAs in a column is equal to the number of rows, it must be entirely NA.

Or similarly

df[colSums(!is.na(df)) > 0]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复