Fastest method to replace data values conditionally in data.table (speed comparison)

前端未结

关注

 1  1620

Why the second method will become slower by increasing the data.table size:

library(data.table)
DF = data.table(x=rep(c(\"a\",\"b\",\"c\"),each=40000000), y=


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  挽巷        
                
              
                            
                2021-01-04 15:00
              
            
            
                                                                       
On your last case, it is a consequence of auto-indexing feature in data.table, since v1.9.4+. Read more for the full picture :-).

When you do DT[col == .] or DT[col %in% .], an index is being generated automatically on your first run. The index is just the order of the column you specify. The computation of indices is quite fast (using counting sort / true radix sorting). 

The table is 120 million rows and it takes roughly:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(data.table:::forderv(DF, "y"))
#   3.923   0.736   4.712 



  Side note: Column y need not be really double (on which ordering takes longer). If we convert it to integer type:

   DF[, y := as.integer(y)]
   system.time(data.table:::forderv(DF, "y"))
   #    user  system elapsed 
   #   0.569   0.140   0.717 



The advantage is that any subsequent subsets on that column using == or %in% will be blazing fast (Slides, R script, video of Matt's presentation). For example:

# clean session, copy/paste code from above to create DF
system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.750   1.121   5.932 

system.time(DF[y==6, y := 10])
#    user  system elapsed 
#   4.002   0.907   4.969 


Oh wait a minute.. it isn't fast. But.. indexing..?!? We're replacing the same column every time with a new value. This results in the order of that column getting changed (thereby removing the index). Let's subsetting y, but modifying v:

# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   4.653   1.071   5.765 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   0.685   0.213   0.910 

options(datatable.verbose=TRUE)
system.time(DF[y==6, v := 10L])
# Using existing index 'y'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: v 
# Assigning to 40000059 row subset of 120000000 rows
#    user  system elapsed 
#   0.683   0.221   0.914 


You can see that the time to compute the indices (using binary search) takes 0 seconds. Also check ?set2key().

If you're not going to do repeated subsetting, or as in your case, subsetting and modifying the same column, then it makes sense to disable the feature by doing options(datatable.auto.index = FALSE), filed #1264:

# clean session
require(data.table)
options(datatable.auto.index = FALSE) # disable auto indexing
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)

system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.067   0.274   1.367 
system.time(DF[y==6, v := 10L])
#    user  system elapsed 
#   1.100   0.314   1.443 


The difference isn't much here. The time to vector scan is system.time(DF$y == 6) = 0.448s.

To sum up, in your case, vector scan makes more sense. But in general, the idea is that it's better to pay the penalty once and have fast results on future subsets on that column, rather than vector scanning each and every time.


  Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using fsort(), and I guess the time spent there might be the reason the timings are quite close, filed #1265).




As to your second case being slow, not quite sure why. I suspect it might be due to unnecessary copies from R's part. What version of R are you using? For the future, always post your sessionInfo() output.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复