Using dynamic column names in `data.table`

前端 未结 1 1229
孤独总比滥情好
孤独总比滥情好 2020-11-28 05:22

I want to calculate mean of each of several columns in a data.table, grouped by another column. My question is similar to two other questions on SO (one and two) but I could

相关标签:
1条回答
  • 2020-11-28 05:39

    you should use .SDcols (especially if you've too many columns and you require a particular operation to be performed only on a subset of the columns (apart from the grouping variable columns).

    dtb[, lapply(.SD, mean), by=condition, .SDcols=2:4]
    
    #    condition  var1   var2    var3
    # 1:       one 101.0 1001.0 10001.0
    # 2:       two 104.0 1004.0 10004.0
    # 3:     three 107.0 1007.0 10007.0
    # 4:      four 109.5 1009.5 10009.5
    

    You could also get all the column names you'd want to take mean of first in a variable and then pass it to .SDcols like this:

    keys <- setdiff(names(dtb), "condition")
    # keys = var1, var2, var3
    dtb[, lapply(.SD, mean), by=condition, .SDcols=keys]
    

    Edit: As Matthew Dowle rightly pointed out, since you require mean to be computed on every other column after grouping by condition, you could just do:

    dtb[, lapply(.SD, mean), by=condition]
    

    David's edit: (which got rejected): Read more about .SD from this post. I find this is relevant here. Thanks @David.

    Edit 2: Suppose you have a data.table with 1000 rows and 301 columns (one column for grouping and 300 numeric columns):

    require(data.table)
    set.seed(45)
    dt <- data.table(grp = sample(letters[1:15], 1000, replace=T))
    m  <- matrix(rnorm(300*1000), ncol=300)
    dt <- cbind(dt, m)
    setkey(dt, "grp")
    

    and you wanted to find the mean of the columns, say, 251:300 alone,

    • you can compute the mean of all the columns and then subset these columns (which is not very efficient as you'll compute on the whole data).

      dt.out <- dt[, lapply(.SD, mean), by=grp]
      dim(dt.out) # 15 * 301, not efficient.
      
    • you can filter the data.table first to just these columns and then compute the mean (which is again not necessarily the best solution as you have to create an extra subset'd data.table every time you want operations on certain columns.

      dt.sub <- dt[, c(1, 251:300)]
      setkey(dt.sub, "grp")
      dt.out <- dt.sub[, lapply(.SD, mean), by=grp]
      
    • you can specify each of the columns one by one as you'd normally do (but this is desirable for smaller data.tables)

      # if you just need one or few columns
      dt.out <- dt[, list(m.v251 = mean(V251)), by = grp]
      

    So what's the best solution? The answer is .SDcols.

    As the documentation states, for a data.table x, .SDcols specifies the columns that are included in .SD.

    This basically implicitly filters the columns that will be passed to .SD instead of creating a subset (as we did before), only it is VERY efficient and FAST!

    How can we do this?

    • By specifiying either the column numbers:

      dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = 251:300]
      dim(dt.out) # 15 * 51 (what we expect)
      
    • Or alternatively by specifying the column id:

      ids <- paste0("V", 251:300) # get column ids
      dt.out <- dt[, lapply(.SD, mean), by=grp, .SDcols = ids]
      dim(dt.out) # 15 * 51 (what we expect)
      

    It accepts both column names and numbers as arguments. In both these cases, .SD will be provided only with these columns we've specified.

    Hope this helps.

    0 讨论(0)
提交回复
热议问题