Search and find values row-wise a dataframe

后端 未结 4 1938
后悔当初
后悔当初 2021-01-06 11:00

My dataframe looks like this:

x1 <- c(\"a\", \"c\", \"f\", \"j\")
x2 <- c(\"b\", \"c\", \"g\", \"k\")
x3 <- c(\"b\", \"d\", \"h\", NA)
x4 <- c(\"         


        
相关标签:
4条回答
  • 2021-01-06 11:14

    As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e. sapply, as.matrix, do.call(_bind, ...) etc.) could be efficient. In this case we could use something like:

    as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
    #[1] 1 0 1 0
    

    And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):

    AL = function() as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
    AM = function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
    identical(AM(), AL())
    #[1] TRUE
    microbenchmark::microbenchmark(AM(), AL(), times = 50)
    #Unit: milliseconds
    # expr      min       lq   median       uq      max neval
    # AM() 49.20072 53.53789 58.03740 66.76898 86.04280    50
    # AL() 45.24706 49.34271 51.43577 55.05866 74.79533    50
    

    There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in Reduce and lapply) didn't prove to be as slow as -probably- would be expected.

    0 讨论(0)
  • 2021-01-06 11:22

    Here's one way to do this:

    df$valueFound <- apply(df,1,function(x){
      if(any(x %in% vec)){ 
        1 
      } else {
        0
      }
    })
    ##
    > df
      x1 x2   x3   x4 valueFound
    1  a  b    b    a          1
    2  c  c    d    e          0
    3  f  g    h    i          1
    4  j  k <NA> <NA>          0
    

    Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

    • apply(df, 1, function(x) any(x %in% vec) + 0)
    • apply(df, 1, function(x) as.numeric(any(x %in% vec)))

    Just for fun, a couple of other interesting variants:

    • apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
    • apply(df, 1, function(x) cumprod(any(x %in% vec)))
    0 讨论(0)
  • 2021-01-06 11:22

    Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl to compare it with vec

    > as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
    [1] 1 0 1 0 
    

    Here's a second option that compares the rows to the unlisted data frame

    > as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
    [1] 1 0 1 0
    
    0 讨论(0)
  • 2021-01-06 11:30

    This would be faster than an apply based solution (despite it's cryptic construction):

    as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
    [1] 1 0 1 0
    

    Update -- Some benchmarks

    Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

    set.seed(1)
    nrow <- 100000
    ncol <- 10
    vec <- c("a", "i", "s", "t", "z")
    df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                            nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)
    

    Here are the approaches we have so far:

    AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
    NR1 <- function() {
      apply(df,1,function(x){
        if(any(x %in% vec)){ 
          1 
        } else {
          0
        }
      })
    }
    NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
    NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
    NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
    NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
    RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
    RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
    

    I'm suspecting the NR functions will be a little slower:

    system.time(NR1()) # Other NR functions are about the same
    #    user  system elapsed 
    #   1.172   0.000   1.196 
    

    And, similarly, Richard's second approach:

    system.time(RS2())
    #    user  system elapsed 
    #   0.918   0.000   0.932 
    

    The grepl and this rowSum function are left for the benchmarks:

    library(microbenchmark)
    microbenchmark(AM(), RS1())
    # Unit: milliseconds
    #   expr       min       lq      mean    median       uq      max neval
    #   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
    #  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100
    
    0 讨论(0)
提交回复
热议问题