The title is pretty straight forward - how can I calculate the difference between the largest and smallest value column-wise, for each row?
Let\'s assume this is my data
Here's an attempt using my old favourite max.col
with a bit of matrix indexing:
rw <- seq_len(nrow(dat))
dat[cbind(rw, max.col(dat))] - dat[cbind(rw, max.col(-dat))]
#[1] 3 9 3 3
This should be much faster on large datasets, as per:
# 5 million big enough?
dat <- dat[sample(1:4,5e6,replace=TRUE),]
system.time({
rw <- seq_len(nrow(dat))
dat[cbind(rw, max.col(dat))] - dat[cbind(rw, max.col(-dat))]
})
# user system elapsed
# 2.43 0.20 2.63
system.time({
apply(X = dat, MARGIN = 1, function(x) diff(range(x)))
})
# user system elapsed
# 94.91 0.17 95.16
1
For each row (using apply
with MARGIN = 1
), use range
to obtain a vector of the minimum and maximum value and then diff
to obtain a difference of those values
apply(X = df, MARGIN = 1, function(x) diff(range(x)))
#[1] 3 9 3 3
2
If you want speedier solution, you can use parallel maxima and minima (pmax
and pmin
)
do.call(pmax, df) - do.call(pmin, df)
#[1] 3 9 3 3
df = structure(list(a = c(1L, 0L, 3L, 9L), b = c(2L, 3L, 2L, 8L),
c = c(3L, 6L, 1L, 7L), d = c(4L, 9L, 4L, 6L)), .Names = c("a",
"b", "c", "d"), class = "data.frame", row.names = c(NA, -4L))
Timings
dat <- df[sample(1:4,5e6,replace=TRUE),]
rw <- seq_len(nrow(dat))
system.time({
apply(X = dat, MARGIN = 1, function(x) diff(range(x)))
})
#STILL RUNNING...
system.time({
rw <- seq_len(nrow(dat))
dat[cbind(rw, max.col(dat))] - dat[cbind(rw, max.col(-dat))]
})
# user system elapsed
# 3.48 0.11 3.59
system.time(do.call(pmax, dat) - do.call(pmin, dat))
# user system elapsed
# 0.23 0.00 0.26
identical(do.call(pmax, dat) - do.call(pmin, dat),
dat[cbind(rw, max.col(dat))] - dat[cbind(rw, max.col(-dat))])
#[1] TRUE