R data.table remove rows where one column is duplicated if another column is NA

后端未结

关注

 3  1653

Here is an example data.table

dt <- data.table(col1 = c(\'A\', \'A\', \'B\', \'C\', \'C\', \'D\'), col2 = c(NA, \'dog\', \'cat\', \'jeep\', \'porsch\', NA))


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  余生分开走        
                
              
                            
                2021-01-26 20:53
              
            
            
                                                                       
You missed the parenthesis (maybe a typo), I suppose it should be length(col1) > 1; And also used ifelse on a scalar condition which will not work as you expect it to (only the first element from the vector is picked up); If you want to remove NA values from a group when there are non NAs, you can use if/else:

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1]

#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2021-01-26 20:59
              
            
            
                                                                       

  group by col1, then if group has more than one row and one of them is NA, remove it. 


Use an anti-join:

dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)]

   col1   col2
1:    A    dog
2:    B    cat
3:    C   jeep
4:    C porsch
5:    D     NA


Benchmark from @thela, but assuming there are no (full) dupes in the original data:

set.seed(1)
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
dt2 = unique(dt2a)

system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#    user  system elapsed 
#    0.73    0.06    0.81

system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#    user  system elapsed 
#    2.86    0.03    2.89 

system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)])
#    user  system elapsed 
#    0.39    0.01    0.41 

fsetequal(res, res_thela) # TRUE
fsetequal(res, res_psidom) # TRUE


I changed a little for speed. With a having= argument, this might become faster and more legible. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2021-01-26 21:00
              
            
            
                                                                       
An attempt to find all the NA cases in groups where there is also a non-NA value, and then remove those rows:

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]
#   col1   col2
#1:    A    dog
#2:    B    cat
#3:    C   jeep
#4:    C porsch
#5:    D     NA


Seems quicker, though I'm sure someone is going to turn up with an even quicker version shortly:

set.seed(1)
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE))
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1])
#   user  system elapsed 
#   1.49    0.02    1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1])
#   user  system elapsed 
#   4.49    0.04    4.54 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复