understanding optimisation messages on assignment by reference in a data.table

前端未结

关注

 1  947

挽巷

This is from an observation during my answering this question from @sds here.

First, let me switch on the trace messages for data.table:

相关标签:

1条回答

臣服心动

2020-12-31 22:47
Update: The expression,
```
DT[, c(..., lapply(.SD, .), ..., by=.]
```
has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

o Complex j-expressions of the form DT[, c(..., lapply(.SD, fun)), by=grp]are now optimised, as long as .SD is only present in the form lapply(.SD, fun).

For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]

But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet. This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.

Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

What would be better is if optimization of j in data.table could handle :
```
DT[, c(lapply(.SD,sum),.N), by=a]
```
That works but may be slow. Currently only the simpler form is optimized :
```
DT[, lapply(.SD,sum), by=a]
```
To answer main question, yes the following :
```
Direct plonk of unnamed RHS, no copy.
```
is desirable compared to :
```
RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.
```
Another way to achieve this is :
```
dt.out[, count := dt[, .N, by=a]$N]
```
I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.
0 讨论(0)
发布评论:

提交评论
- 加载中...