Why does selecting column(s) from a data.table results in a copy?

前端 未结 1 479
渐次进展
渐次进展 2021-01-04 08:41

It appears that selecting column(s) from the data.table with [.data.table results in a copy of the underlying vector(s). I am talking about very simple column s

相关标签:
1条回答
  • 2021-01-04 09:27

    It's been a while since I thought about this, but here goes.

    Good question. But why do you need to subset a data.table like that? We really need to see what you are doing next: the bigger picture. It's that bigger picture that we probably have a different way for in data.table than the base R idiom.

    Roughly illustrating with probably a bad example :

    DT[region=="EU", lapply(.SD, sum), .SDcols=10:20]
    

    rather than the base R idiom of taking a subset and then doing something next (here, apply) on the result outside :

    apply(DT[DT$region=="EU", 10:20], 2, sum)
    

    In general, we want to encourage doing as much as possible inside one [...] so that data.table sees the i, j and by together in one [...] operation and can optimize the combination. When you subset columns and then do the next thing outside afterwards it requires more software complexity to optimize. In most cases, most of the computational cost is inside the first [...] which reduces to a relatively insignificant size.

    With that said, in addition to Frank's comment about shallow, we're also waiting to see how the ALTREP project pans out. That improves reference counting in base R and may enable := to know reliably whether a column it is operating on needs to be copy-on-write first or not. Currently,:= always updates by reference so it would update both data.table's if selecting-some-whole-columns did not take a deep copy (it is deliberate that it does copy, for that reason). If := is not used inside [...] then [...] always returns a new result which is safe to use := on, which is quite a straightforward rule currently. Even if all you're doing is selecting a few whole columns for some reason.

    We really need to see the bigger picture please: what you're doing afterwards on the subset of columns. Having that clear would help to raise the priority in either investigating ALTREP or perhaps doing our own reference count for this case.

    0 讨论(0)
提交回复
热议问题