The function set
or the expression :=
inside [.data.table
allows user to update data.tables by reference. How does this behavior diffe
In data.table
, :=
and all set*
functions update objects by reference. This was introduced sometime around 2012 IIRC. And at this time, base R did not shallow copy, but deep copied. Shallow copy was introduced since 3.1.0.
It's a wordy/lengthy answer, but I think this answers your first two questions:
How is this base R method different from the data.table equivalent? Is the difference related only to speed or also memory use?
In base R v3.1.0+ when we do:
DF1 = data.frame(x=1:5, y=6:10, z=11:15)
DF2 = DF1[, c("x", "y")]
DF3 = transform(DF2, y = ifelse(y>=8L, 1L, y))
DF4 = transform(DF2, y = 2L)
DF1
to DF2
, both columns are only shallow copied. DF2
to DF3
the column y
alone had to be copied/re-allocated, but x
gets shallow copied again. DF2
to DF4
, same as (2). That is, columns are shallow copied as long as the column remains unchanged - in a way, the copy is being delayed unless absolutely necessary.
In data.table
, we modify in-place. Meaning even during DF3
and DF4
column y
doesn't get copied.
DT2[y >= 8L, y := 1L] ## (a)
DT2[, y := 2L]
Here, since y
is already an integer column, and we are modifying it by integer, by reference, there's no new memory allocation made here at all.
This is also particularly useful when you'd like to sub-assign by reference (marked as (a) above). This is a handy feature we really like in data.table
.
Another advantage that comes for free (that I came to know from our interactions) is, when we've to, say, convert all columns of a data.table to a numeric
type, from say, character
type:
DT[, (cols) := lapply(.SD, as.numeric), .SDcols = cols]
Here, since we're updating by reference, each character column gets replaced by reference with it's numeric counterpart. And after that replacement, the earlier character column isn't required anymore and is up for grabs for garbage collection. But if you were to do this using base R:
DF[] = lapply(DF, as.numeric)
All the columns will have to be converted to numeric, and that'll have to be held in a temporary variable, and then finally will be assigned back to DF
. That means, if you've 10 columns with a 100 million rows, each of character type, then your DF
takes a space of:
10 * 100e6 * 4 / 1024^3 = ~ 3.7GB
And since numeric
type is twice as much in size, we'll need a total of 7.4GB + 3.7GB
of space for us to make the conversion using base R.
But note that data.table
copies during DF1
to DF2
. That is:
DT2 = DT1[, c("x", "y")]
results in a copy, because we can't sub-assign by reference on a shallow copy. It'll update all the clones.
What would be great is if we could integrate seamlessly the shallow copy feature, but keep track of whether a particular object's columns has multiple references, and update by reference wherever possible. R's upgraded reference counting feature might be very useful in this regard. In any case, we're working towards it.
For your last question:
"When is the difference most sizeable?"
There are still people who have to use older versions of R, where deep copies can't be avoided.
It depends on how many columns are being copied because the operations you perform on it. Worst case scenario would be that you've copied all the columns, of course.
There are cases like this where shallow copying won't benefit.
When you'd like to update columns of a data.frame for each group, and there are too many groups.
When you'd like to update a column of say, data.table DT1
based on a join with another data.table DT2
- this can be done as:
DT1[DT2, col := i.val]
where i.
refers to the value from val
column of DT2
(the i
argument) for matching rows. This syntax allows for performing this operation very efficiently, instead of having to first join the entire result, and then update the required column.
All in all, there are strong arguments where update by reference would save a lot of time, and be fast. But people sometimes like to not update objects in-place, and are willing to sacrifice speed/memory for it. We're trying to figure out how best to provide this functionality as well, in addition to the already existing update by reference.
Hope this helps. This is already quite a lengthy answer. I'll leave any questions you might have left to others or for you to figure out (other than any obvious misconceptions in this answer).