Why the second method will become slower by increasing the data.table size:
library(data.table)
DF = data.table(x=rep(c(\"a\",\"b\",\"c\"),each=40000000), y=
On your last case, it is a consequence of auto-indexing feature in data.table
, since v1.9.4+. Read more for the full picture :-).
When you do DT[col == .]
or DT[col %in% .]
, an index is being generated automatically on your first run. The index is just the order
of the column you specify. The computation of indices is quite fast (using counting sort / true radix sorting).
The table is 120 million rows and it takes roughly:
# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
system.time(data.table:::forderv(DF, "y"))
# 3.923 0.736 4.712
Side note: Column
y
need not be really double (on which ordering takes longer). If we convert it to integer type:DF[, y := as.integer(y)] system.time(data.table:::forderv(DF, "y")) # user system elapsed # 0.569 0.140 0.717
The advantage is that any subsequent subsets on that column using ==
or %in%
will be blazing fast (Slides, R script, video of Matt's presentation). For example:
# clean session, copy/paste code from above to create DF
system.time(DF[y==6, y := 10])
# user system elapsed
# 4.750 1.121 5.932
system.time(DF[y==6, y := 10])
# user system elapsed
# 4.002 0.907 4.969
Oh wait a minute.. it isn't fast. But.. indexing..?!? We're replacing the same column every time with a new value. This results in the order of that column getting changed (thereby removing the index). Let's subsetting y
, but modifying v
:
# clean session
require(data.table)
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
system.time(DF[y==6, v := 10L])
# user system elapsed
# 4.653 1.071 5.765
system.time(DF[y==6, v := 10L])
# user system elapsed
# 0.685 0.213 0.910
options(datatable.verbose=TRUE)
system.time(DF[y==6, v := 10L])
# Using existing index 'y'
# Starting bmerge ...done in 0 secs
# Detected that j uses these columns: v
# Assigning to 40000059 row subset of 120000000 rows
# user system elapsed
# 0.683 0.221 0.914
You can see that the time to compute the indices (using binary search) takes 0 seconds. Also check ?set2key()
.
If you're not going to do repeated subsetting, or as in your case, subsetting and modifying the same column, then it makes sense to disable the feature by doing options(datatable.auto.index = FALSE)
, filed #1264:
# clean session
require(data.table)
options(datatable.auto.index = FALSE) # disable auto indexing
set.seed(1L)
DF = data.table(x=rep(c("a","b","c"),each=40000000), y=sample(c(1,3,6),40000000,T), v=1:9)
system.time(DF[y==6, v := 10L])
# user system elapsed
# 1.067 0.274 1.367
system.time(DF[y==6, v := 10L])
# user system elapsed
# 1.100 0.314 1.443
The difference isn't much here. The time to vector scan is system.time(DF$y == 6)
= 0.448s
.
To sum up, in your case, vector scan makes more sense. But in general, the idea is that it's better to pay the penalty once and have fast results on future subsets on that column, rather than vector scanning each and every time.
Auto indexing feature is relatively new, and will be extended over time, and probably optimised (perhaps there are places we've not looked at). While answering this Q, I realised that we don't show the time to compute the sort order (using
fsort()
, and I guess the time spent there might be the reason the timings are quite close, filed #1265).
As to your second case being slow, not quite sure why. I suspect it might be due to unnecessary copies from R's part. What version of R are you using? For the future, always post your sessionInfo()
output.