Understanding exactly when a data.table is a reference to (vs a copy of) another data.table

前端 未结 2 797
南旧
南旧 2020-11-21 07:46

I\'m having a little trouble understanding the pass-by-reference properties of data.table. Some operations seem to \'break\' the reference, and I\'d like to und

2条回答
  •  太阳男子
    2020-11-21 08:14

    Just a quick sum up.

    <- with data.table is just like base; i.e., no copy is taken until a subassign is done afterwards with <- (such as changing the column names or changing an element such as DT[i,j]<-v). Then it takes a copy of the whole object just like base. That's known as copy-on-write. Would be better known as copy-on-subassign, I think! It DOES NOT copy when you use the special := operator, or the set* functions provided by data.table. If you have large data you probably want to use them instead. := and set* will NOT COPY the data.table, EVEN WITHIN FUNCTIONS.

    Given this example data :

    DT <- data.table(a=c(1,2), b=c(11,12))
    

    The following just "binds" another name DT2 to the same data object bound currently bound to the name DT :

    DT2 <- DT
    

    This never copies, and never copies in base either. It just marks the data object so that R knows that two different names (DT2 and DT) point to the same object. And so R will need to copy the object if either are subassigned to afterwards.

    That's perfect for data.table, too. The := isn't for doing that. So the following is a deliberate error as := isn't for just binding object names :

    DT2 := DT    # not what := is for, not defined, gives a nice error
    

    := is for subassigning by reference. But you don't use it like you would in base :

    DT[3,"foo"] := newvalue    # not like this
    

    you use it like this :

    DT[3,foo:=newvalue]    # like this
    

    That changed DT by reference. Say you add a new column new by reference to the data object, there is no need to do this :

    DT <- DT[,new:=1L]
    

    because the RHS already changed DT by reference. The extra DT <- is to misunderstand what := does. You can write it there, but it's superfluous.

    DT is changed by reference, by :=, EVEN WITHIN FUNCTIONS :

    f <- function(X){
        X[,new2:=2L]
        return("something else")
    }
    f(DT)   # will change DT
    
    DT2 <- DT
    f(DT)   # will change both DT and DT2 (they're the same data object)
    

    data.table is for large datasets, remember. If you have a 20GB data.table in memory then you need a way to do this. It's a very deliberate design decision of data.table.

    Copies can be made, of course. You just need to tell data.table that you're sure you want to copy your 20GB dataset, by using the copy() function :

    DT3 <- copy(DT)   # rather than DT3 <- DT
    DT3[,new3:=3L]     # now, this just changes DT3 because it's a copy, not DT too.
    

    To avoid copies, don't use base type assignation or update :

    DT$new4 <- 1L                 # will make a copy so use :=
    attr(DT,"sorted") <- "a"      # will make a copy use setattr() 
    

    If you want to be sure that you are updating by reference use .Internal(inspect(x)) and look at the memory address values of the constituents (see Matthew Dowle's answer).

    Writing := in j like that allows you subassign by reference by group. You can add a new column by reference by group. So that's why := is done that way inside [...] :

    DT[, newcol:=mean(x), by=group]
    

提交回复
热议问题