Understanding data.table invalid .selfref warning

前端 未结 1 1399
谎友^
谎友^ 2021-01-18 15:16

I am trying to figuring out the data.table \'invalid .selfref\' error that I am getting with the code below.

library(data.table) 
library(dplyr)
DT <- dat         


        
相关标签:
1条回答
  • 2021-01-18 15:37

    I just ran your code, and I see the problem. data.table over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.

    Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.

    The figure on the left depicts pictorially how adding (or cbinding) a column to a data.frame works. cbinding a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).

    The figure on the right shows the data.table way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table creation. And by using :=, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.

    This is about the difference and as to what over-allocation here means.

    Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).

    My guess is that your dplyr syntax somehow removes this over-allocation which is caught int the next step when you use := and data.table over-allocates once again before to add new column by reference (which'll result in a shallow copy).

    If I do it the data.table way:

    DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
    DT[, ee := 3]
    

    it works just fine.

    I don't have the time to look into dplyr right now to verify or find out what's doing this.

    Update: Have suggested necessary changes as a pull request here.

    0 讨论(0)
提交回复
热议问题