I\'m trying to figure out the best way how to use multiple variable filters in R.
Usually have up to 100 variables (in one condition) and need to filter cases where ANY
We could use Reduce
with lapply
vys1 <- dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]
identical(as.vector(vys), vys1)
#[1] TRUE
Based on the same sort of benchmarks used
pm<-proc.time()
vys<-((x1==37) | (x2==37) | (x3==37) | (x4==37) | (x5==37) | (x6==37) | (x7==37) | (x8==37) | (x9==37) | (x10==37))
proc.time() - pm
# user system elapsed
# 0.05 0.13 0.93
pm<-proc.time()
vys1 <- dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]
proc.time() - pm
# user system elapsed
# 0.05 0.03 0.08
You're looking for a function that works on every row of your dataframe. That's what "apply" is doing. It's equally fast as the solution of others, but easy to handle:
system.time(
((x1==37) | (x2==37) | (x3==37) | (x4==37) | (x5==37) | (x6==37) | (x7==37) | (x8==37) | (x9==37) | (x10==37))
)
# user system elapsed
# 0.02 0.00 0.02
system.time(
apply(df, 1 , function(x) any(x[2:11]==37))
)
# user system elapsed
# 0.59 0.00 0.61
Although you don't ask for changing data structure, I recommend have a look at tidy data. With a rearranged version of your dataframe you can do filterings efficient and easy to handle:
library(tidyr)
df2 = gather(df, key, value, -id)
system.time(
select(filter(df, value==37), id)
)
# user system elapsed
# 0.02 0.00 0.02
I think your second statement in base R is OK, just try it with [
instead of subset
:
rowSums(df[sprintf("x%d", 1:10)]==37) > 0
Benchmarks:
library(microbenchmark)
microbenchmark( times = 20,
subset = {((rowSums(subset(df,select=c(x1:x10))==37)>0))},
dt_reduce = {dt[, Reduce('|', lapply(.SD, '==', 37)), .SDcols= x1:x10]},
base_r = {rowSums(df[sprintf("x%d", 1:10)]==37) > 0}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# subset 82.74922 88.63819 99.69935 91.18369 110.24876 134.06550 20
# dt_reduce 25.78002 28.62765 32.73945 28.89021 29.12712 71.25822 20
# base_r 21.52504 24.27624 27.03380 25.83219 26.24400 65.38550 20