How to implement coalesce efficiently in R

前端 未结 8 2471
深忆病人
深忆病人 2020-11-21 23:20

Background

Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can b

8条回答
  •  眼角桃花
    2020-11-21 23:58

    Looks like coalesce1 is still available

    coalesce1 <- function(...) {
        ans <- ..1
        for (elt in list(...)[-1]) {
            i <- is.na(ans)
            ans[i] <- elt[i]
        }
        ans
    }
    

    which is faster still (but more-or-less a hand re-write of Reduce, so less general)

    > identical(coalesce(a, b, c), coalesce1(a, b, c))
    [1] TRUE
    > microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
    Unit: microseconds
                   expr     min       lq   median       uq     max neval
      coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
     coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
     coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100
    

    Or for larger data compare

    coalesce1a <- function(...) {
        ans <- ..1
        for (elt in list(...)[-1]) {
            i <- which(is.na(ans))
            ans[i] <- elt[i]
        }
        ans
    }
    

    showing that which() can sometimes be effective, even though it implies a second pass through the index.

    > aa <- sample(a, 100000, TRUE)
    > bb <- sample(b, 100000, TRUE)
    > cc <- sample(c, 100000, TRUE)
    > microbenchmark(coalesce1(aa, bb, cc),
    +                coalesce1a(aa, bb, cc),
    +                coalesce2(aa,bb,cc), times=10)
    Unit: milliseconds
                       expr       min        lq    median        uq       max neval
      coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
     coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
      coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10
    

提交回复
热议问题