I\'m having a little trouble understanding the pass-by-reference properties of data.table
. Some operations seem to \'break\' the reference, and I\'d like to und
<-
with data.table
is just like base; i.e., no copy is taken until a subassign is done afterwards with <-
(such as changing the column names or changing an element such as DT[i,j]<-v
). Then it takes a copy of the whole object just like base. That's known as copy-on-write. Would be better known as copy-on-subassign, I think! It DOES NOT copy when you use the special :=
operator, or the set*
functions provided by data.table
. If you have large data you probably want to use them instead. :=
and set*
will NOT COPY the data.table
, EVEN WITHIN FUNCTIONS.
Given this example data :
DT <- data.table(a=c(1,2), b=c(11,12))
The following just "binds" another name DT2
to the same data object bound currently bound to the name DT
:
DT2 <- DT
This never copies, and never copies in base either. It just marks the data object so that R knows that two different names (DT2
and DT
) point to the same object. And so R will need to copy the object if either are subassigned to afterwards.
That's perfect for data.table
, too. The :=
isn't for doing that. So the following is a deliberate error as :=
isn't for just binding object names :
DT2 := DT # not what := is for, not defined, gives a nice error
:=
is for subassigning by reference. But you don't use it like you would in base :
DT[3,"foo"] := newvalue # not like this
you use it like this :
DT[3,foo:=newvalue] # like this
That changed DT
by reference. Say you add a new column new
by reference to the data object, there is no need to do this :
DT <- DT[,new:=1L]
because the RHS already changed DT
by reference. The extra DT <-
is to misunderstand what :=
does. You can write it there, but it's superfluous.
DT
is changed by reference, by :=
, EVEN WITHIN FUNCTIONS :
f <- function(X){
X[,new2:=2L]
return("something else")
}
f(DT) # will change DT
DT2 <- DT
f(DT) # will change both DT and DT2 (they're the same data object)
data.table
is for large datasets, remember. If you have a 20GB data.table
in memory then you need a way to do this. It's a very deliberate design decision of data.table
.
Copies can be made, of course. You just need to tell data.table that you're sure you want to copy your 20GB dataset, by using the copy()
function :
DT3 <- copy(DT) # rather than DT3 <- DT
DT3[,new3:=3L] # now, this just changes DT3 because it's a copy, not DT too.
To avoid copies, don't use base type assignation or update :
DT$new4 <- 1L # will make a copy so use :=
attr(DT,"sorted") <- "a" # will make a copy use setattr()
If you want to be sure that you are updating by reference use .Internal(inspect(x))
and look at the memory address values of the constituents (see Matthew Dowle's answer).
Writing :=
in j
like that allows you subassign by reference by group. You can add a new column by reference by group. So that's why :=
is done that way inside [...]
:
DT[, newcol:=mean(x), by=group]