How to implement coalesce efficiently in R

前端 未结 8 2446
深忆病人
深忆病人 2020-11-21 23:20

Background

Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can b

相关标签:
8条回答
  • 2020-11-21 23:51

    Using dplyr package:

    library(dplyr)
    coalesce(a, b, c)
    # [1]  1  2 NA  4  6
    

    Benchamark, not as fast as accepted solution:

    coalesce2 <- function(...) {
      Reduce(function(x, y) {
        i <- which(is.na(x))
        x[i] <- y[i]
        x},
        list(...))
    }
    
    microbenchmark::microbenchmark(
      coalesce(a, b, c),
      coalesce2(a, b, c)
    )
    
    # Unit: microseconds
    #                expr    min     lq     mean median      uq     max neval cld
    #   coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293   100   b
    #  coalesce2(a, b, c)  7.127  8.553  9.68731  9.123  9.6930  27.368   100  a 
    

    But on a larger dataset, it is comparable:

    aa <- sample(a, 100000, TRUE)
    bb <- sample(b, 100000, TRUE)
    cc <- sample(c, 100000, TRUE)
    
    microbenchmark::microbenchmark(
      coalesce(aa, bb, cc),
      coalesce2(aa, bb, cc))
    
    # Unit: milliseconds
    #                   expr      min       lq     mean   median       uq      max neval cld
    #   coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766   100   a
    #  coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223   100   a
    
    0 讨论(0)
  • 2020-11-21 23:56

    From data.table >= 1.12.3 you can use fcoalesce.

    library(data.table)
    fcoalesce(a, b, c)
    # [1]  1  2 NA  4  6
    

    For more info, including a benchmark, see NEWS item #18 for development version 1.12.3.

    0 讨论(0)
  • 2020-11-21 23:58

    Looks like coalesce1 is still available

    coalesce1 <- function(...) {
        ans <- ..1
        for (elt in list(...)[-1]) {
            i <- is.na(ans)
            ans[i] <- elt[i]
        }
        ans
    }
    

    which is faster still (but more-or-less a hand re-write of Reduce, so less general)

    > identical(coalesce(a, b, c), coalesce1(a, b, c))
    [1] TRUE
    > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
    Unit: microseconds
                   expr     min       lq   median       uq     max neval
      coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
     coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
     coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100
    

    Or for larger data compare

    coalesce1a <- function(...) {
        ans <- ..1
        for (elt in list(...)[-1]) {
            i <- which(is.na(ans))
            ans[i] <- elt[i]
        }
        ans
    }
    

    showing that which() can sometimes be effective, even though it implies a second pass through the index.

    > aa <- sample(a, 100000, TRUE)
    > bb <- sample(b, 100000, TRUE)
    > cc <- sample(c, 100000, TRUE)
    > microbenchmark(coalesce1(aa, bb, cc),
    +                coalesce1a(aa, bb, cc),
    +                coalesce2(aa,bb,cc), times=10)
    Unit: milliseconds
                       expr       min        lq    median        uq       max neval
      coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
     coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
      coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10
    
    0 讨论(0)
  • 2020-11-22 00:00

    A very simple solution is to use the ifelse function from the base package:

    coalesce3 <- function(x, y) {
    
        ifelse(is.na(x), y, x)
    }
    

    Although it appears to be slower than coalesce2 above:

    test <- function(a, b, func) {
    
        for (i in 1:10000) {
    
            func(a, b)
        }
    }
    
    system.time(test(a, b, coalesce2))
    user  system elapsed 
    0.11    0.00    0.10 
    
    system.time(test(a, b, coalesce3))
    user  system elapsed 
    0.16    0.00    0.15 
    

    You can use Reduce to make it work for an arbitrary number of vectors:

    coalesce4 <- function(...) {
    
        Reduce(coalesce3, list(...))
    }
    
    0 讨论(0)
  • 2020-11-22 00:08

    On my machine, using Reduce gets a 5x performance improvement:

    coalesce2 <- function(...) {
      Reduce(function(x, y) {
        i <- which(is.na(x))
        x[i] <- y[i]
        x},
      list(...))
    }
    
    > microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
    Unit: microseconds
                   expr    min       lq   median       uq     max neval
      coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100
     coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100
    
    0 讨论(0)
  • 2020-11-22 00:12

    Another apply method, with mapply.

    mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
    [1]  1  2 NA  4  6
    

    This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.

    Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.

    unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
                   dots=list(a, b, c), MoreArgs=NULL))
    [1]  1  2 NA  4  6
    

    .mapplydiffers in important ways from its non-dotted cousin.

    • it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
    • the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
    • Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
    0 讨论(0)
提交回复
热议问题