How to apply same function to every specified column in a data.table

后端未结

关注

 7  2245

I have a data.table with which I\'d like to perform the same operation on certain columns. The names of these columns are given in a character vector. In this particular ex

相关标签:

7条回答

独厮守ぢ

2020-11-22 08:24

To add example to create new columns based on a string vector of columns. Based on Jfly answer:

dt <- data.table(a = rnorm(1:100), b = rnorm(1:100), c = rnorm(1:100), g = c(rep(1:10, 10)))

col0 <- c("a", "b", "c")
col1 <- paste0("max.", col0)  

for(i in seq_along(col0)) {
  dt[, (col1[i]) := max(get(col0[i])), g]
}

dt[,.N, c("g", col1)]

0 讨论(0)

孤独总比滥情好

2020-11-22 08:24

library(data.table)
(dt <- data.table(a = 1:3, b = 1:3, d = 1:3))

Hence:

   a b d
1: 1 1 1
2: 2 2 2
3: 3 3 3

Whereas (dt*(-1)) yields:

    a  b  d
1: -1 -1 -1
2: -2 -2 -2
3: -3 -3 -3

0 讨论(0)

暖寄归人

2020-11-22 08:31

dplyr functions work on data.tables, so here's a dplyr solution that also "avoids the for-loop" :)

dt %>% mutate(across(all_of(cols), ~ -1 * .))

I benchmarked it using orhan's code (adding rows and columns) and you'll see dplyr::mutate with across mostly executes faster than most of the other solutions and slower than the data.table solution using lapply.

library(data.table); library(dplyr)
dt <- data.table(a = 1:100000, b = 1:100000, d = 1:100000) %>% 
  mutate(a2 = a, a3 = a, a4 = a, a5 = a, a6 = a)
cols <- c("a", "b", "a2", "a3", "a4", "a5", "a6")

dt %>% mutate(across(all_of(cols), ~ -1 * .))
#>               a       b      d      a2      a3      a4      a5      a6
#>      1:      -1      -1      1      -1      -1      -1      -1      -1
#>      2:      -2      -2      2      -2      -2      -2      -2      -2
#>      3:      -3      -3      3      -3      -3      -3      -3      -3
#>      4:      -4      -4      4      -4      -4      -4      -4      -4
#>      5:      -5      -5      5      -5      -5      -5      -5      -5
#>     ---                                                               
#>  99996:  -99996  -99996  99996  -99996  -99996  -99996  -99996  -99996
#>  99997:  -99997  -99997  99997  -99997  -99997  -99997  -99997  -99997
#>  99998:  -99998  -99998  99998  -99998  -99998  -99998  -99998  -99998
#>  99999:  -99999  -99999  99999  -99999  -99999  -99999  -99999  -99999
#> 100000: -100000 -100000 100000 -100000 -100000 -100000 -100000 -100000

library(microbenchmark)
mbm = microbenchmark(
  base_with_forloop = for (col in 1:length(cols)) {
    dt[ , eval(parse(text = paste0(cols[col], ":=-1*", cols[col])))]
  },
  franks_soln1_w_lapply = dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols],
  franks_soln2_w_forloop =  for (j in cols) set(dt, j = j, value = -dt[[j]]),
  orhans_soln_w_forloop = for (j in cols) dt[,(j):= -1 * dt[,  ..j]],
  orhans_soln2 = dt[,(cols):= - dt[,..cols]],
  dplyr_soln = (dt %>% mutate(across(all_of(cols), ~ -1 * .))),
  times=1000
)

library(ggplot2)
ggplot(mbm) +
  geom_violin(aes(x = expr, y = time)) +
  coord_flip()

^{Created on 2020-10-16 by the reprex package (v0.3.0)}

0 讨论(0)

[愿得一人]

2020-11-22 08:40
I would like to add an answer, when you would like to change the name of the columns as well. This comes in quite handy if you want to calculate the logarithm of multiple columns, which is often the case in empirical work.
```
cols <- c("a", "b")
out_cols = paste("log", cols, sep = ".")
dt[, c(out_cols) := lapply(.SD, function(x){log(x = x, base = exp(1))}), .SDcols = cols]
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-11-22 08:44
None of above solutions seems to work with calculation by group. Following is the best I got:
```
for(col in cols)
{
    DT[, (col) := scale(.SD[[col]], center = TRUE, scale = TRUE), g]
}
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
梦毁少年i

2020-11-22 08:46
This seems to work:
```
dt[ , (cols) := lapply(.SD, "*", -1), .SDcols = cols]
```
The result is
```
    a  b d
1: -1 -1 1
2: -2 -2 2
3: -3 -3 3
```
There are a few tricks here:
- Because there are parentheses in (cols) :=, the result is assigned to the columns specified in cols, instead of to some new variable named "cols".
- .SDcols tells the call that we're only looking at those columns, and allows us to use .SD, the Subset of the Data associated with those columns.
- lapply(.SD, ...) operates on .SD, which is a list of columns (like all data.frames and data.tables). lapply returns a list, so in the end j looks like cols := list(...).
EDIT: Here's another way that is probably faster, as @Arun mentioned:
```
for (j in cols) set(dt, j = j, value = -dt[[j]])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页