How to apply same function to every specified column in a data.table

后端 未结 7 2221
北海茫月
北海茫月 2020-11-22 08:00

I have a data.table with which I\'d like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular ex

相关标签:
7条回答
  • 2020-11-22 08:24

    To add example to create new columns based on a string vector of columns. Based on Jfly answer:

    dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))
    
    col0 <- c("a", "b", "c")
    col1 <- paste0("max.", col0)  
    
    for(i in seq_along(col0)) {
      dt[, (col1[i]) := max(get(col0[i])), g]
    }
    
    dt[,.N, c("g", col1)]
    
    0 讨论(0)
  • 2020-11-22 08:24
    library(data.table)
    (dt <- data.table(a = 1:3, b = 1:3, d = 1:3))
    
    Hence:
    
       a b d
    1: 1 1 1
    2: 2 2 2
    3: 3 3 3
    
    Whereas (dt*(-1)) yields:
    
        a  b  d
    1: -1 -1 -1
    2: -2 -2 -2
    3: -3 -3 -3
    
    0 讨论(0)
  • 2020-11-22 08:31

    dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)

    dt %>% mutate(across(all_of(cols), ~ -1 * .))

    I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.

    library(data.table); library(dplyr)
    dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>% 
      mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
    cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")
    
    dt %>% mutate(across(all_of(cols), ~ -1 * .))
    #>               a       b      d      a2      a3      a4      a5      a6
    #>      1:      -1      -1      1      -1      -1      -1      -1      -1
    #>      2:      -2      -2      2      -2      -2      -2      -2      -2
    #>      3:      -3      -3      3      -3      -3      -3      -3      -3
    #>      4:      -4      -4      4      -4      -4      -4      -4      -4
    #>      5:      -5      -5      5      -5      -5      -5      -5      -5
    #>     ---                                                               
    #>  99996:  -99996  -99996  99996  -99996  -99996  -99996  -99996  -99996
    #>  99997:  -99997  -99997  99997  -99997  -99997  -99997  -99997  -99997
    #>  99998:  -99998  -99998  99998  -99998  -99998  -99998  -99998  -99998
    #>  99999:  -99999  -99999  99999  -99999  -99999  -99999  -99999  -99999
    #> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000
    
    library(microbenchmark)
    mbm = microbenchmark(
      base_with_forloop = for (col in 1:length(cols)) {
        dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
      },
      franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
      franks_soln2_w_forloop =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
      orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
      orhans_soln2 = dt[,(cols):= - dt[,..cols]],
      dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
      times=1000
    )
    
    library(ggplot2)
    ggplot(mbm) +
      geom_violin(aes(x = expr, y = time)) +
      coord_flip()
    

    Created on 2020-10-16 by the reprex package (v0.3.0)

    0 讨论(0)
  • 2020-11-22 08:40

    I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.

    cols <- c("a", "b")
    out_cols = paste("log", cols, sep = ".")
    dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
    
    0 讨论(0)
  • 2020-11-22 08:44

    None of above solutions seems to work with calculation by group. Following is the best I got:

    for(col in cols)
    {
        DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
    }
    
    0 讨论(0)
  • 2020-11-22 08:46

    This seems to work:

    dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
    

    The result is

        a  b d
    1: -1 -1 1
    2: -2 -2 2
    3: -3 -3 3
    

    There are a few tricks here:

    • Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
    • .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
    • lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).

    EDIT: Here's another way that is probably faster, as @Arun mentioned:

    for (j in cols) set(dt, j = j, value = -dt[[j]])
    
    0 讨论(0)
提交回复
热议问题