可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
There are other posts about row-wise operators on datatable. They are either too simple or solves a specific scenario
My question here is more generic. There is a solution using dplyr. I have played around but failed to find a an equivalent solution using data.table syntax. Can you please suggest an elegant data.table solution that reproduce the same results than the dplyr version?
EDIT 1: Summary of benchmarks of the suggested solutions on real dataset (10MB, 73000 rows, stats made on 24 numeric columns). The benchmark results is subjective. However, the elapsed time is consistently reproducible.
| Solution By | Speed compared to dplyr | |-------------|-----------------------------| | Metrics v1 | 4.3 times SLOWER (use .SD) | | Metrics v2 | 5.6 times FASTER | | ExperimenteR| 15 times FASTER | | Arun v1 | 3 times FASTER (Map func)| | Arun v2 | 3 times FASTER (foo func)| | Ista | 4.5 times FASTER |
EDIT 2: I have added NACount column a day after. This is why this column is not found in the solutions suggested by various contributors.
Data Setup
library(data.table) dt
SOLUTION using dplyr + rowwise()
library(dplyr) ; library(magrittr) dt %>% rowwise() %>% transmute(ProductName, Country, Q1, Q2, Q3, Q4, AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE), MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE), MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE), SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE), NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4)))) # ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM NAcnt # 1 Lettuce CA NA 22 51 79 50.66667 22 79 152 1 # 2 Beetroot FR 61 8 NA 10 26.33333 8 61 79 1 # 3 Spinach FR 40 NA NA 49 44.50000 40 49 89 2 # 4 Kale CA 54 5 16 NA 25.00000 5 54 75 1 # 5 Carrot CA NA NA NA NA NaN Inf -Inf 0 4
ERROR with data.table (compute entire column instead of per-row)
dt[, .(ProductName, Country, Q1, Q2, Q3, Q4, AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE), MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE), MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE), SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE), NAcnt= sum(is.na(c(Q1, Q2, Q3, Q4))))] # ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM NAcnt # 1: Lettuce CA NA 22 51 79 35.90909 5 79 395 9 # 2: Beetroot FR 61 8 NA 10 35.90909 5 79 395 9 # 3: Spinach FR 40 NA NA 49 35.90909 5 79 395 9 # 4: Kale CA 54 5 16 NA 35.90909 5 79 395 9 # 5: Carrot CA NA NA NA NA 35.90909 5 79 395 9
ALMOST solution but more complex and missing Q1,Q2,Q3,Q4 output columns
dtmelt
回答1:
You can use an efficient row-wise functions from matrixStats
package.
library(matrixStats) dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T), MAX = rowMaxs(as.matrix(.SD), na.rm=T), AVG = rowMeans(.SD, na.rm=T), SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)] dt # ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM # 1: Lettuce CA NA 22 51 79 22 79 50.66667 152 # 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79 # 3: Spinach FR 40 NA 79 49 40 79 56.00000 168 # 4: Kale CA 54 5 16 NA 5 54 25.00000 75 # 5: Carrot CA NA NA NA NA Inf -Inf NaN 0
For dataset with 500000 rows(using the data.table
from CRAN)
dt
rowwise
(or by=1:nrow(dt)
) is "euphemism" for for loop
, as exemplified by
library(dplyr) ; library(magrittr) system.time(dt %>% rowwise() %>% transmute(ProductName, Country, Q1, Q2, Q3, Q4, MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE), MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE), AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE), SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE))) # user system elapsed # 80.832 0.111 80.974 system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] ) # user system elapsed # 141.492 0.196 141.757
回答2:
With by=1:nrow(dt)
, performs the rowwise operation in data.table
library(data.table) dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c(Q1, Q2,Q3,Q4),by=1:nrow(dt)] ProductName Country Q1 Q2 Q3 Q4 AVG MIN MAX SUM 1: Lettuce CA NA 22 51 79 50.66667 22 79 152 2: Beetroot FR 61 8 NA 10 26.33333 8 61 79 3: Spinach FR 40 NA 79 49 56.00000 40 79 168 4: Kale CA 54 5 16 NA 25.00000 5 54 75 5: Carrot CA NA NA NA NA NaN Inf -Inf 0 Warning messages: 1: In min(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to min; returning Inf 2: In max(c(NA_real_, NA_real_, NA_real_, NA_real_), na.rm = TRUE) : no non-missing arguments to max; returning -Inf
You got warning messages, because in row 5, you are computing max, sum, min, and max of nothing. For example, see below:
min(c(NA,NA,NA,NA),na.rm=TRUE) [1] Inf Warning message: In min(c(NA, NA, NA, NA), na.rm = TRUE) : no non-missing arguments to min; returning Inf
回答3:
Just another way (not that efficient though, as na.omit()
is called each time, and many memory allocations as well):
require(data.table) new_cols = c("MIN", "MAX", "SUM", "AVG") dt[, (new_cols) := Map(function(x, f) f(x), list(na.omit(c(Q1,Q2,Q3,Q4))), list(min, max, sum, mean)), by = 1:nrow(dt)] # ProductName Country Q1 Q2 Q3 Q4 MIN MAX SUM AVG # 1: Lettuce CA NA 22 51 79 22 79 152 50.66667 # 2: Beetroot FR 61 8 NA 10 8 61 79 26.33333 # 3: Spinach FR 40 NA 79 49 40 79 168 56.00000 # 4: Kale CA 54 5 16 NA 5 54 75 25.00000 # 5: Carrot CA NA NA NA NA Inf -Inf 0 NaN
But as I mentioned, this'll get much simpler once colwise()
and rowwise()
are implemented. The syntax in this case could look something like:
dt[, rowwise(.SD, list(MIN=min, MAX=max, SUM=sum, AVG=mean), na.rm=TRUE), by = 1:nrow(dt)] # `by = ` is really not necessary in this case.
or even more straightforward for this case:
rowwise(dt, list(...), na.rm=TRUE)
Edit:
Another variation:
myNACount
回答4:
The apply
function can be used to perform row-wise calculations. Defining the function separately keeps things cleaner:
dstats
The function can now be applied over the rows of the data.table.
(dt[, c("AVG", "MIN", "MAX", "SUM") := data.frame(t(apply(.SD, 1, dstats))), .SDcols=c("Q1", "Q2","Q3","Q4"), with = FALSE])
Notice that the only advantage of doing this with [.data.table
is that it allows the use of :=
for fast adding by reference.
This is slower but more flexible than the matrixStats
solution, and faster than the dplyr
solution by @ExperimenteR, clocking in at 36 seconds (my timings for the other methods were similar to those in @ExperimenteR's answer).