Including all permutations when using data.table[,,by=…]

后端 未结 2 1936
长情又很酷
长情又很酷 2021-01-18 22:40

I have a large data.table that I am collapsing to the month level using ,by.

There are 5 by vars, with # of levels: c(4,3,106,3,1380)

相关标签:
2条回答
  • 2021-01-18 23:14

    Make a cartesian join of the unique values, and use that to join back to your results

    dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
    setkey(datCollapsed, g1, g2, g3)
    nrow(datCollapsed[dat.keys])  # effectively a left join of datCollapsed onto dat.keys
    # [1] 625
    

    Note that the missing values are NA right now, but you can easily change that to 0s if you want.

    0 讨论(0)
  • 2021-01-18 23:31

    I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:

    keycols <- c("g1", "g2", "g3")                       ## Grouping columns
    setkeyv(dat, keycols)                                ## Set dat's key
    ii <- do.call(CJ, sapply(dat[, ..keycols], unique))  ## CJ() to form index
    datCollapsed <- dat[ii, list(nv=.N)]                 ## Aggregate
    
    ## Check that it worked
    nrow(datCollapsed)
    # [1] 625
    table(datCollapsed$nv)
    #   0   1   2   3   4   5   6 
    # 135 191 162  82  39  13   3 
    

    This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:

    Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in i. When i is a data.table, DT[i,j] evaluates j for each row of i. We call this by without by or grouping by i. Hence, the self join DT[data.table(unique(colA)),j] is identical to DT[,j,by=colA].

    0 讨论(0)
提交回复
热议问题