Warning: 'Invalid .internal.selfref detected' when adding a column to a data.table returned from a function

前端 未结 2 1225
南方客
南方客 2020-12-09 11:29

This seems as fread bug, but I am not sure.

This example reproduce my problem. I have a function where I read a data.table and return it in a list. i us

相关标签:
2条回答
  • 2020-12-09 12:17

    Arun's answer is a great explanation. The specific feature of list() in R <= 3.0.2 is that it copies named inputs (things that have been named before the call to list()). In r-devel now (the next version of R), this copy by list() no longer happens and all will be well. It's a very welcome change in R.

    In the meantime, you can work around it by creating the output list in a different way.

    > R.version.string
    [1] "R version 3.0.2 (2013-09-25)"
    

    First demonstrate list() copying :

    > DT = data.table(a=1:3)
    > address(DT)
    [1] "0x1d70010"
    > address(list(DT)[[1]])
    [1] "0x21bc178"    # different address => list() copied the data.table named DT
    > data.table:::selfrefok(DT)
    [1] 1
    > data.table:::selfrefok(list(DT)[[1]])
    [1] 0              # i.e. this copied DT is not over-allocated
    

    Now a different way to create the same list :

    > ans = list()
    > ans$DT = DT    # use $<- instead
    > address(DT)
    [1] "0x1d70010"
    > address(ans$DT)
    [1] "0x1d70010"    # good, no copy
    > identical(ans, list(DT=DT))
    [1] TRUE
    > data.table:::selfrefok(ans$DT)
    [1] 1              # good, the list()-ed DT is still over-allocated ok
    

    Convoluted and confusing, I know. Using $<- to create the output list, or even just placing the call to fread inside the call to list() i.e. list(DT=fread(...)) should avoid the copy by list().

    0 讨论(0)
  • 2020-12-09 12:22

    This has nothing to do with fread per se, but that you're calling list() and passing it a named object. We can recreate this by doing:

    require(data.table)
    DT <- data.table(x=1:2)       # name the object 'DT'
    DT.l <- list(DT=DT)           # create a list containing one data.table
    y <- DT.l$DT                  # get back the data.table
    y[, bla := 1L]                # now add by reference
    # works fine but warning message will occur
    
    DT.l = list(DT=data.table(x=1:2))   # DT = a call, not a named object
    y = DT.l$DT
    y[, bla:=1L]
    # works fine and no warning message
    

    Good news:

    The good news is that from R version >= 3.1.0 (now in devel), passing a named object to list() will no longer create a copy, rather, its reference count (number of objects pointing to this value) just gets bumped. So, the problem goes away with the next version of R.

    To understand how data.table detects copies using .internal.selfref, we'll dive into some history of data.table.

    First, some history:

    You should know that data.table over-allocates column pointer slots (truelength is set to a default of 100) on creation so that := can be used to add columns by reference later on. There was one issue with this as such - handling copies. For example, when we call list() and pass it a named object, a copy is being made, as illustrated below.

    tracemem(DT)
    # [1] "<0x7fe23ac3e6d0>"
    DT.list <- list(DT=DT)    # `DT` is the named object on the RHS of = here
    # tracemem[0x7fe23ac3e6d0 -> 0x7fe23cd72f48]: 
    

    The problem with any copy of data.table that R makes (not data.table's copy()) is that R internally sets the truelength parameter to 0 even though truelength(.) function will still return the correct result. This inadvertently led to a segfault when updated by reference with :=, because, the over-allocation didn't exist anymore (or at least is not recognised anymore). This happened in versions < 1.7.8. In order to overcome this, an attribute called .internal.selfref was introduced. You can check this attribute by doing attributes(DT).

    From NEWS (of v1.7.8):

    o The 'Chris crash' is fixed. The root cause was that key<- always copies the whole table. The problem with that copy (other than being slower) is that R doesn't maintain the over allocated truelength, but it looks as though it has. key<- was used internally, in particular in merge(). So, adding a column using := after merge() was a memory overwrite, since the over allocated memory wasn't really there after key<-'s copy.

    data.tables now have a new attribute .internal.selfref to catch and warn about such copies in future. All internal use of key<- has been replaced with setkey(), or new function setkeyv() which accepts a vector, and do not copy.

    What does this .internal.selfref do?

    It just points to itself, basically. It's simply an attribute attached to DT that contains the address in RAM of DT. If R inadvertently copies DT, the address of DT will move in RAM but the attribute attached will still contain the old memory address, they won't match any more. data.table checks they do match (i.e. is valid) before adding a new column by reference into a spare column pointer slot.

    How is .internal.selfref implemented ?

    In order to understand this attribute .internal.selfref, we've to understand what an external pointer (EXTPTRSXP) is. This page explains nicely. Copy/pasting the essential lines:

    External pointer SEXPs are intended to handle references to C structures such as handles, and are used for this purpose in package RODBC for example. They are unusual in their copying semantics in that when an R object is copied, the external pointer object is not duplicated.

    They are created as:

    SEXP R_MakeExternalPtr(void *p, SEXP tag, SEXP prot);
    

    where p is the pointer (and hence this cannot portably be a function pointer), and tag and prot are references to ordinary R objects which will remain in existence (be protected from garbage collection) for the lifetime of the external pointer object. A useful convention is to use the tag field for some form of type identification and the prot field for protecting the memory that the external pointer represents, if that memory is allocated from the R heap.

    In our case, we create the attribute .internal.selfref of/for DT, whose value is an external pointer to NULL (the address of which you see in the attribute value) and this external pointer's prot field is another external pointer back to DT (hence referred to as selfref) with its prot set to NULL this time.

    Note: We've to employ this extptr to NULL whose 'prot' is an extptr strategy so that identical(DT1, DT2) which are two different copies, but with same content returns TRUE. (If you don't understand what this means, you can just skip to the next part. It's not relevant to understanding the answer to this question).

    Okay, how does this all work then?

    We know that the external pointer does not get duplicated during a copy. Basically, when we create a data.table, the attribute .internal.selfref creates an external pointer to NULL with it's prot field creating an external pointer back to DT. Now, when an unintentional "copy" is being made, the object's address gets modified but not the address protected by the attribute. It still points to DT whether it exists or not.. because it won't/can't be modified. This is therefore detected internally by checking the address of the current object and the address protected by the external pointer. If they don't match, then a "copy" has been made by R (that would have lost the over-allocation that data.table carefully created). That is:

    DT <- data.table(x=1:2) # internal selfref set
    DT.list <- list(DT=DT)  # copy made, address(DT.list$DT) != address(DT)
                            # and truelength would be affected.
    
    DT.new <- DT.list$DT    # address of DT.new != address of DT
                            # and it's not equal to the address pointed to by
                            # the attribute's 'prot' external pointer
    
    # so a re-over-allocation has to be made by data.table at the next update by
    # reference, and it warns so you can fix the root cause by not using list(),
    # key<-, names<- etc.
    

    That's a lot to take in. I think I've managed to get it through as clear as possible. If there're any mistakes (it took me a while to wrap this around my head) or possibilities for further clarity, feel free to edit or comment with your suggestions.

    Hope this clears up things.

    0 讨论(0)
提交回复
热议问题