Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

前端 未结 2 795
南旧
南旧 2020-11-21 07:46

I\'m having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to \'break\' the reference, and I\'d like to und

相关标签:
2条回答
  • 2020-11-21 08:01

    Yes, it's subassignment in R using <- (or = or ->) that makes a copy of the whole object. You can trace that using tracemem(DT) and .Internal(inspect(DT)), as below. The data.table features := and set() assign by reference to whatever object they are passed. So if that object was previously copied (by a subassigning <- or an explicit copy(DT)) then it's the copy that gets modified by reference.

    DT <- data.table(a = c(1, 2), b = c(11, 12)) 
    newDT <- DT 
    
    .Internal(inspect(DT))
    # @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
    #   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
    # ATTRIB:  # ..snip..
    
    .Internal(inspect(newDT))   # precisely the same object at this point
    # @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
    #   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
    # ATTRIB:  # ..snip..
    
    tracemem(newDT)
    # [1] "<0x0000000003b7e2a0"
    
    newDT$b[2] <- 200
    # tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
    # tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 
    
    .Internal(inspect(DT))
    # @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
    #   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
    # ATTRIB:  # ..snip..
    
    .Internal(inspect(newDT))
    # @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
    #   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
    # ATTRIB:  # ..snip..
    

    Notice how even the a vector was copied (different hex value indicates new copy of vector), even though a wasn't changed. Even the whole of b was copied, rather than just changing the elements that need to be changed. That's important to avoid for large data, and why := and set() were introduced to data.table.

    Now, with our copied newDT we can modify it by reference :

    newDT
    #      a   b
    # [1,] 1  11
    # [2,] 2 200
    
    newDT[2, b := 400]
    #      a   b        # See FAQ 2.21 for why this prints newDT
    # [1,] 1  11
    # [2,] 2 400
    
    .Internal(inspect(newDT))
    # @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
    #   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
    # ATTRIB:  # ..snip ..
    

    Notice that all 3 hex values (the vector of column points, and each of the 2 columns) remain unchanged. So it was truly modified by reference with no copies at all.

    Or, we can modify the original DT by reference :

    DT[2, b := 600]
    #      a   b
    # [1,] 1  11
    # [2,] 2 600
    
    .Internal(inspect(DT))
    # @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
    #   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
    #   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
    #   ATTRIB:  # ..snip..
    

    Those hex values are the same as the original values we saw for DT above. Type example(copy) for more examples using tracemem and comparison to data.frame.

    Btw, if you tracemem(DT) then DT[2,b:=600] you'll see one copy reported. That is a copy of the first 10 rows that the print method does. When wrapped with invisible() or when called within a function or script, the print method isn't called.

    All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

    0 讨论(0)
  • 2020-11-21 08:14

    Just a quick sum up.

    <- with data.table is just like base; i.e., no copy is taken until a subassign is done afterwards with <- (such as changing the column names or changing an element such as DT[i,j]<-v). Then it takes a copy of the whole object just like base. That's known as copy-on-write. Would be better known as copy-on-subassign, I think! It DOES NOT copy when you use the special := operator, or the set* functions provided by data.table. If you have large data you probably want to use them instead. := and set* will NOT COPY the data.table, EVEN WITHIN FUNCTIONS.

    Given this example data :

    DT <- data.table(a=c(1,2), b=c(11,12))
    

    The following just "binds" another name DT2 to the same data object bound currently bound to the name DT :

    DT2 <- DT
    

    This never copies, and never copies in base either. It just marks the data object so that R knows that two different names (DT2 and DT) point to the same object. And so R will need to copy the object if either are subassigned to afterwards.

    That's perfect for data.table, too. The := isn't for doing that. So the following is a deliberate error as := isn't for just binding object names :

    DT2 := DT    # not what := is for, not defined, gives a nice error
    

    := is for subassigning by reference. But you don't use it like you would in base :

    DT[3,"foo"] := newvalue    # not like this
    

    you use it like this :

    DT[3,foo:=newvalue]    # like this
    

    That changed DT by reference. Say you add a new column new by reference to the data object, there is no need to do this :

    DT <- DT[,new:=1L]
    

    because the RHS already changed DT by reference. The extra DT <- is to misunderstand what := does. You can write it there, but it's superfluous.

    DT is changed by reference, by :=, EVEN WITHIN FUNCTIONS :

    f <- function(X){
        X[,new2:=2L]
        return("something else")
    }
    f(DT)   # will change DT
    
    DT2 <- DT
    f(DT)   # will change both DT and DT2 (they're the same data object)
    

    data.table is for large datasets, remember. If you have a 20GB data.table in memory then you need a way to do this. It's a very deliberate design decision of data.table.

    Copies can be made, of course. You just need to tell data.table that you're sure you want to copy your 20GB dataset, by using the copy() function :

    DT3 <- copy(DT)   # rather than DT3 <- DT
    DT3[,new3:=3L]     # now, this just changes DT3 because it's a copy, not DT too.
    

    To avoid copies, don't use base type assignation or update :

    DT$new4 <- 1L                 # will make a copy so use :=
    attr(DT,"sorted") <- "a"      # will make a copy use setattr() 
    

    If you want to be sure that you are updating by reference use .Internal(inspect(x)) and look at the memory address values of the constituents (see Matthew Dowle's answer).

    Writing := in j like that allows you subassign by reference by group. You can add a new column by reference by group. So that's why := is done that way inside [...] :

    DT[, newcol:=mean(x), by=group]
    
    0 讨论(0)
提交回复
热议问题