Short Answer, use copy
colsdt <- copy(colnames(dt))
Then you are all good.
dt[,double_quantity:=quantity*2]
str(colsdt)
# chr [1:2] "fruit" "quantity"
What's going in is that in general (ie, in base R
), the assignment operator <-
creates a new copy of the object when assigning a value to an object. This is true even when assigning to the same object name, as in x <- x + 1
, or a lot more costly, DF$newCol <- DF$a + DF$b
. With large objects (think 100K+ rows, dozens or hundreds of columns. Worse if more columns), this can get very costly.
data.table
, through pure wizardry (read: C code) avoids this overhead. Instead what it does is set a pointer to the
same memory location where the object value is already stored. This is what offers the huge efficiency & spped boost.
But it also means that you often have objects that might otherwise appear to be completely differnet and independent objects
are in fact one and the same
This is where copy
comes in. It creates a new copy of an object, as opposed to passing by reference.
some more detail as to why this is happening.
note: I am using the terms "source" and "destination" very loosely, where they refer to the assignment relationship destination <- source
This is in fact expected behavoir, admittadly a bit obfuscated.
In base R
, when you assign via <-
, the two objects point to the same memory location until one of them changes.
This way of handling memory has many benefits, namely that so long as the two objects have the same exact value, there is no need to duplicate memory. This step is held off as long as possible.
a <- 1:5
b <- a
.Internal(inspect(a)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
.Internal(inspect(b)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
^^^^ Notice the same memory location
Once either of the two objects change, then that "bond" is broken. That is, changing either the "source" or "destination" object will cause that object to be reassigned to a new memory location.
a[[3]] <- a[[3]] + 1
.Internal(inspect(a)) # @11004bc38 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,4,4,5
^^^^ New Location
.Internal(inspect(b)) # @11a5e2a88 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
^^^^^ Still same as it was before;
note the actual value. This is where `a` _had_ been
The problem in data.table
s case is that we rarely reassign the actual data.table object.
Notice that if we modify the "destination" object, then it gets moved (copied) off of that memory location.
colsdt <- colnames(dt)
.Internal(inspect(colnames(dt))) # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
.Internal(inspect(colsdt)) # @114859280 16 STRSXP g0c7 [MARK,NAM(2)] (len=2, tl=100)
^^^^ Notice the same memory location
# insiginificant change
colsdt[] <- colsdt
.Internal(inspect(colsdt)) # @100aa4a40 16 STRSXP g0c2 [NAM(1)] (len=2, tl=100)
# we can test the original issue from the OP:
dt[, newCol := quantity*2]
str(colnames(dt)) # chr [1:3] "fruit" "quantity" "newCol"
str(colsdt) # chr [1:2] "fruit" "quantity"
The situation to avoid:
However, since when working with data.table
, we are (almost) always modifying by reference, this can cause unexpected results. Namely, the situation where:
- we assign from a data.table object using standard
<-
assignment operator
- then subsequently we change the value of the "source" data.table
- we expect (and our code might depend on) the "destination" object to still have the value previously assigned to it.
This of course will cause an issue.
data.table
is an amazingly powerful package. The source of its strength is its long hair the fact that it avoids making copies whenever possible.
Best Practice:
This shifts the onus to the user to be deliberate and judicious when copying and expecting for a copy to be made.
In other words, the best practices is:
When you expect a copy to exist, use the copy function.