Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

前端未结

关注

 2  1724

I recently discovered binary search in data.table. If the table is sorted on multiple keys it possible to search on the 2nd key only ?

DT = dat


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-11-27 17:31
              
            
            
                                                                       
Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]


If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]


Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well.  If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]


There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 Build in secondary keys

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2020-11-27 17:35
              
            
            
                                                                       
Based on this email thread I wrote the following functions:

create_index = function(dt, ..., verbose = getOption("datatable.verbose")) {
  cols = data.table:::getdots()
  res = dt[, cols, with=FALSE]
  res[, i:=1:nrow(dt)]
  setkeyv(res, cols, verbose = verbose)
}

JI = function(index, ...) {
  index[J(...),i]$i
}


Here are the results on my system with a larger DT (1e8 rows):

> system.time(DT[J("c")])
   user  system elapsed 
  0.168   0.136   0.306 

> system.time(DT[J(unique(x), 25)])
   user  system elapsed 
  2.472   1.508   3.980 
> system.time(DT[y==25])
   user  system elapsed 
  4.532   2.149   6.674 

> system.time(IDX_y <- create_index(DT, y))
   user  system elapsed 
  3.076   2.428   5.503 
> system.time(DT[JI(IDX_y, 25)])
   user  system elapsed 
  0.512   0.320   0.831     


If you are using the index multiple times it is worth it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复