My dataframe looks like this:
x1 <- c(\"a\", \"c\", \"f\", \"j\")
x2 <- c(\"b\", \"c\", \"g\", \"k\")
x3 <- c(\"b\", \"d\", \"h\", NA)
x4 <- c(\"
As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e. sapply
, as.matrix
, do.call(_bind, ...)
etc.) could be efficient. In this case we could use something like:
as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
#[1] 1 0 1 0
And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):
AL = function() as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
AM = function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
identical(AM(), AL())
#[1] TRUE
microbenchmark::microbenchmark(AM(), AL(), times = 50)
#Unit: milliseconds
# expr min lq median uq max neval
# AM() 49.20072 53.53789 58.03740 66.76898 86.04280 50
# AL() 45.24706 49.34271 51.43577 55.05866 74.79533 50
There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in Reduce
and lapply
) didn't prove to be as slow as -probably- would be expected.
Here's one way to do this:
df$valueFound <- apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
##
> df
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
Thanks to @David Arenburg and @CathG, a couple of more concise approaches:
apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))
Just for fun, a couple of other interesting variants:
apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))
Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl
to compare it with vec
> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0
Here's a second option that compares the rows to the unlisted data frame
> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0
This would be faster than an apply
based solution (despite it's cryptic construction):
as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0
Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.
set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)
Here are the approaches we have so far:
AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
I'm suspecting the NR functions will be a little slower:
system.time(NR1()) # Other NR functions are about the same
# user system elapsed
# 1.172 0.000 1.196
And, similarly, Richard's second approach:
system.time(RS2())
# user system elapsed
# 0.918 0.000 0.932
The grepl
and this rowSum
function are left for the benchmarks:
library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
# expr min lq mean median uq max neval
# AM() 65.75296 67.2527 92.03043 84.58111 102.3199 234.6114 100
# RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525 100