Fastest method to replace data values conditionally in data.table (speed comparison)

前端 未结 1 1620
醉酒成梦
醉酒成梦 2021-01-04 14:26

Why the second method will become slower by increasing the data.table size:

library(data.table)
DF = data.table(x=rep(c(\"a\",\"b\",\"c\"),each=40000000), y=         


        
相关标签:
1条回答
  • 2021-01-04 15:00

    On your last case, it is a consequence of auto-indexing feature in data.table, since v1.9.4+. Read more for the full picture :-).

    When you do DT[col == .] or DT[col %in% .], an index is being generated automatically on your first run. The index is just the order of the column you specify. The computation of indices is quite fast (using counting sort / true radix sorting).

    The table is 120 million rows and it takes roughly:

    # clean session
    require(data.table)
    set.seed(1L)
    DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
    
    system.time(data.table:::forderv(DF, "y"))
    #   3.923   0.736   4.712 
    

    Side note: Column y need not be really double (on which ordering takes longer). If we convert it to integer type:

       DF[, y := as.integer(y)]
       system.time(data.table:::forderv(DF, "y"))
       #    user  system elapsed 
       #   0.569   0.140   0.717 
    

    The advantage is that any subsequent subsets on that column using == or %in% will be blazing fast (Slides, R script, video of Matt's presentation). For example:

    # clean session, copy/paste code from above to create DF
    system.time(DF[y==6, y := 10])
    #    user  system elapsed 
    #   4.750   1.121   5.932 
    
    system.time(DF[y==6, y := 10])
    #    user  system elapsed 
    #   4.002   0.907   4.969 
    

    Oh wait a minute.. it isn't fast. But.. indexing..?!? We're replacing the same column every time with a new value. This results in the order of that column getting changed (thereby removing the index). Let's subsetting y, but modifying v:

    # clean session
    require(data.table)
    set.seed(1L)
    DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
    
    system.time(DF[y==6, v := 10L])
    #    user  system elapsed 
    #   4.653   1.071   5.765 
    system.time(DF[y==6, v := 10L])
    #    user  system elapsed 
    #   0.685   0.213   0.910 
    
    options(datatable.verbose=TRUE)
    system.time(DF[y==6, v := 10L])
    # Using existing index 'y'
    # Starting bmerge ...done in 0 secs
    # Detected that j uses these columns: v 
    # Assigning to 40000059 row subset of 120000000 rows
    #    user  system elapsed 
    #   0.683   0.221   0.914 
    

    You can see that the time to compute the indices (using binary search) takes 0 seconds. Also check ?set2key().

    If you're not going to do repeated subsetting, or as in your case, subsetting and modifying the same column, then it makes sense to disable the feature by doing options(datatable.auto.index = FALSE), filed #1264:

    # clean session
    require(data.table)
    options(datatable.auto.index = FALSE) # disable auto indexing
    set.seed(1L)
    DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
    
    system.time(DF[y==6, v := 10L])
    #    user  system elapsed 
    #   1.067   0.274   1.367 
    system.time(DF[y==6, v := 10L])
    #    user  system elapsed 
    #   1.100   0.314   1.443 
    

    The difference isn't much here. The time to vector scan is system.time(DF$y == 6) = 0.448s.

    To sum up, in your case, vector scan makes more sense. But in general, the idea is that it's better to pay the penalty once and have fast results on future subsets on that column, rather than vector scanning each and every time.

    Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using fsort(), and I guess the time spent there might be the reason the timings are quite close, filed #1265).


    As to your second case being slow, not quite sure why. I suspect it might be due to unnecessary copies from R's part. What version of R are you using? For the future, always post your sessionInfo() output.

    0 讨论(0)
提交回复
热议问题