Search and find values row-wise a dataframe

后端未结

关注

 4  1942

My dataframe looks like this:

x1 <- c(\"a\", \"c\", \"f\", \"j\")
x2 <- c(\"b\", \"c\", \"g\", \"k\")
x3 <- c(\"b\", \"d\", \"h\", NA)
x4 <- c(\"


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  梦毁少年i        
                
              
                            
                2021-01-06 11:14
              
            
            
                                                                       
As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e. sapply, as.matrix, do.call(_bind, ...) etc.) could be efficient. In this case we could use something like:

as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
#[1] 1 0 1 0


And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):

AL = function() as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
AM = function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
identical(AM(), AL())
#[1] TRUE
microbenchmark::microbenchmark(AM(), AL(), times = 50)
#Unit: milliseconds
# expr      min       lq   median       uq      max neval
# AM() 49.20072 53.53789 58.03740 66.76898 86.04280    50
# AL() 45.24706 49.34271 51.43577 55.05866 74.79533    50


There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in Reduce and lapply) didn't prove to be as slow as -probably- would be expected.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2021-01-06 11:22
              
            
            
                                                                       
Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0


Thanks to @David Arenburg and @CathG, a couple of more concise approaches:


apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))


Just for fun, a couple of other interesting variants:


apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-01-06 11:22
              
            
            
                                                                       
Since you don't want a loop, you could get creative and paste the  columns together by row, and then use grepl to compare it with vec

> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0 


Here's a second option that compares the rows to the unlisted data frame

> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  闹比i        
                
              
                            
                2021-01-06 11:30
              
            
            
                                                                       
This would be faster than an apply based solution (despite it's cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0




Update -- Some benchmarks

Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)


Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])


I'm suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196 


And, similarly, Richard's second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932 


The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复