How many unique keys does my data.table have?

后端未结

关注

 2  528

Given a data.table, how do I find the number of unique keys it contains?

library(data.table)
z <- data.table(id=c(1,2,1,3),key=\"id\")
length(uni


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-25 05:10
              
            
            
                                                                       
I'll expand my comment as an answer.

base::unique (unique.default) on vectors uses hash tables and is quite efficient, with average complexity of O(1) - this is very likely to be the general case. The worst case complexity is O(n). But the chances of that happening at each insert/search should be extremely rare - it must be a terrible hash function if it does.

In your question, you've only one key column, and therefore base's unique should be quite efficient. However, on more than one column, unique.data.frame is very inefficient - as it coerces all the columns to characters, then pastes them together and then calls unique.default on it.

You can use:

nrow(unique(z))


data.table's unique method, by default, provides key columns to its by argument. And since we know the data is already sorted, instead of ordering, we use data.table:::uniqlist to fetch the indices corresponding to unique rows much more efficiently in O(n) as well. It's therefore efficient on any amount of key columns.

However we could add this information as an attribute while setting the key, as it's quite straightforward.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南方客        
                
              
                            
                2021-01-25 05:17
              
            
            
                                                                       
Maybe this:

sum(Negate(duplicated)(z$id))


z$id remains sorted, so duplicated can work faster on it:

bigVec <- sample(1:100000, 30000000, replace=TRUE)
system.time( sum(Negate(duplicated)(bigVec)) )
   user  system elapsed 
  8.161   0.475   8.690 

bigVec <- sort(bigVec)
system.time( sum(Negate(duplicated)(bigVec)) )
   user  system elapsed 
   0.00    2.09    2.10 


But I just checked and length(unique()) works faster on sorted vectors as well...

So maybe there is some kind of checking if the vector is sorted going on (which can be done in a linear time). To me this doesn't look to be quadratic:

system.time( length(unique(bigVec)) )
   user  system elapsed 
  0.000   0.583   0.664 

bigVec <- sort(sample(1:100000, 20000000, replace=TRUE))
system.time( length(unique(bigVec)) )
   user  system elapsed 
  0.000   1.290   1.242 

bigVec <- sort(sample(1:100000, 30000000, replace=TRUE))
system.time( length(unique(bigVec)) )
   user  system elapsed 
  0.000   1.655   1.715 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复