This seems as fread
bug, but I am not sure.
This example reproduce my problem. I have a function where I read a data.table and return it in a list. i us
Arun's answer is a great explanation. The specific feature of list()
in R <= 3.0.2 is that it copies named inputs (things that have been named before the call to list()
). In r-devel now (the next version of R), this copy by list()
no longer happens and all will be well. It's a very welcome change in R.
In the meantime, you can work around it by creating the output list in a different way.
> R.version.string
[1] "R version 3.0.2 (2013-09-25)"
First demonstrate list() copying :
> DT = data.table(a=1:3)
> address(DT)
[1] "0x1d70010"
> address(list(DT)[[1]])
[1] "0x21bc178" # different address => list() copied the data.table named DT
> data.table:::selfrefok(DT)
[1] 1
> data.table:::selfrefok(list(DT)[[1]])
[1] 0 # i.e. this copied DT is not over-allocated
Now a different way to create the same list :
> ans = list()
> ans$DT = DT # use $<- instead
> address(DT)
[1] "0x1d70010"
> address(ans$DT)
[1] "0x1d70010" # good, no copy
> identical(ans, list(DT=DT))
[1] TRUE
> data.table:::selfrefok(ans$DT)
[1] 1 # good, the list()-ed DT is still over-allocated ok
Convoluted and confusing, I know. Using $<-
to create the output list, or even just placing the call to fread
inside the call to list()
i.e. list(DT=fread(...))
should avoid the copy by list()
.
This has nothing to do with fread
per se, but that you're calling list()
and passing it a named object. We can recreate this by doing:
require(data.table)
DT <- data.table(x=1:2) # name the object 'DT'
DT.l <- list(DT=DT) # create a list containing one data.table
y <- DT.l$DT # get back the data.table
y[, bla := 1L] # now add by reference
# works fine but warning message will occur
DT.l = list(DT=data.table(x=1:2)) # DT = a call, not a named object
y = DT.l$DT
y[, bla:=1L]
# works fine and no warning message
The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list()
will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.
To understand how data.table
detects copies using .internal.selfref
, we'll dive into some history of data.table
.
You should know that data.table
over-allocates column pointer slots (truelength is set to a default of 100) on creation so that :=
can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list()
and pass it a named object, a copy is being made, as illustrated below.
tracemem(DT)
# [1] "<0x7fe23ac3e6d0>"
DT.list <- list(DT=DT) # `DT` is the named object on the RHS of = here
# tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]:
The problem with any copy of data.table
that R makes (not data.table
's copy()
) is that R internally sets the truelength
parameter to 0 even though truelength(.)
function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=
, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref
was introduced. You can check this attribute by doing attributes(DT)
.
From NEWS (of v1.7.8):
o The 'Chris crash' is fixed. The root cause was that
key<-
always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocatedtruelength
, but it looks as though it has.key<-
was used internally, in particular inmerge()
. So, adding a column using:=
aftermerge()
was a memory overwrite, since the over allocated memory wasn't really there afterkey<-
's copy.
data.tables
now have a new attribute.internal.selfref
to catch and warn about such copies in future. All internal use ofkey<-
has been replaced withsetkey()
, or new functionsetkeyv()
which accepts a vector, and do not copy.
.internal.selfref
do?It just points to itself, basically. It's simply an attribute attached to DT
that contains the address in RAM of DT
. If R inadvertently copies DT
, the address of DT
will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table
checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.
.internal.selfref
implemented ?In order to understand this attribute .internal.selfref
, we've to understand what an external pointer (EXTPTRSXP
) is. This page explains nicely. Copy/pasting the essential lines:
External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.
They are created as:
SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);
where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.
In our case, we create the attribute .internal.selfref
of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot
field is another external pointer back to DT
(hence referred to as selfref) with its prot
set to NULL this time.
Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2)
which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).
We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot
field creating an external pointer back to DT
. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT
whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:
DT <- data.table(x=1:2) # internal selfref set
DT.list <- list(DT=DT) # copy made, address(DT.list$DT) != address(DT)
# and truelength would be affected.
DT.new <- DT.list$DT # address of DT.new != address of DT
# and it's not equal to the address pointed to by
# the attribute's 'prot' external pointer
# so a re-over-allocation has to be made by data.table at the next update by
# reference, and it warns so you can fix the root cause by not using list(),
# key<-, names<- etc.
That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.
Hope this clears up things.