I have a large data.table
that I am collapsing to the month level using ,by
.
There are 5 by vars, with # of levels: c(4,3,106,3,1380)
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.
I'd also go with a cross-join, but would use it in the i
-slot of the original call to [.data.table
:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table
, it is just as efficient and fast as passing the grouping instructions in via the by
argument:
Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in
i
. Wheni
is adata.table
,DT[i,j]
evaluatesj
for each row ofi
. We call this by withoutby
or grouping byi
. Hence, the self joinDT[data.table(unique(colA)),j]
is identical toDT[,j,by=colA]
.