understanding optimisation messages on assignment by reference in a data.table

前端 未结 1 947
挽巷
挽巷 2020-12-31 22:39

This is from an observation during my answering this question from @sds here.

First, let me switch on the trace messages for data.table:

<
相关标签:
1条回答
  • 2020-12-31 22:47

    Update: The expression,

    DT[, c(..., lapply(.SD, .), ..., by=.]
    

    has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

    o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).

    For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
    is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]

    But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet. This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.


    Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

    What would be better is if optimization of j in data.table could handle :

    DT[, c(lapply(.SD,sum),.N), by=a]
    

    That works but may be slow. Currently only the simpler form is optimized :

    DT[, lapply(.SD,sum), by=a]
    

    To answer main question, yes the following :

    Direct plonk of unnamed RHS, no copy.
    

    is desirable compared to :

    RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
    

    Another way to achieve this is :

    dt.out[, count := dt[, .N, by=a]$N]
    

    I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

    0 讨论(0)
提交回复
热议问题