More elegant way to return a sequence of numbers based on booleans?

前端 未结 3 1969
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-30 09:28

Here\'s a sample of booleans I have as part of a data.frame:

atest <- c(FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,

相关标签:
3条回答
  • 2020-12-30 09:43

    Here's one way to do it, using handy (but not widely-known/used) base functions:

    > sequence(tabulate(cumsum(!atest)))
     [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1
    

    To break it down:

    > # return/repeat integer for each FALSE
    > cumsum(!atest)
     [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3
    > # count the number of occurrences of each integer
    > tabulate(cumsum(!atest))
    [1] 10 10  1
    > # create concatenated seq_len for each integer
    > sequence(tabulate(cumsum(!atest)))
     [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1
    
    0 讨论(0)
  • 2020-12-30 09:49

    Here is another approach using other familiar functions:

    seq_along(atest) - cummax(seq_along(atest) * !atest) + 1L
    

    Because it is all vectorized, it is noticeably faster than @Joshua's solution (if speed is of any concern):

    f0 <- function(x) sequence(tabulate(cumsum(!x)))
    f1 <- function(x) {i <- seq_along(x); i - cummax(i * !x) + 1L}
    x  <- rep(atest, 10000)
    
    library(microbenchmark)
    microbenchmark(f0(x), f1(x))
    # Unit: milliseconds
    #   expr       min        lq    median        uq      max neval
    #  f0(x) 19.386581 21.853194 24.511783 26.703705 57.20482   100
    #  f1(x)  3.518581  3.976605  5.962534  7.763618 35.95388   100
    
    identical(f0(x), f1(x))
    # [1] TRUE
    
    0 讨论(0)
  • 2020-12-30 09:51

    Problems like these tend to work well with Rcpp. Borrowing @flodel's code as a framework for benchmarking,

    boolseq.cpp
    -----------
    
    #include <Rcpp.h>
    using namespace Rcpp;
    
    // [[Rcpp::export]]
    IntegerVector boolSeq(LogicalVector x) {
      int n = x.length();
      IntegerVector output = no_init(n);
      int counter = 1;
      for (int i=0; i < n; ++i) {
        if (!x[i]) {
          counter = 1;
        }
        output[i] = counter;
        ++counter;
      }
      return output;
    }
    
    /*** R
    x <- c(FALSE, sample( c(FALSE, TRUE), 1E5, TRUE ))
    
    f0 <- function(x) sequence(tabulate(cumsum(!x)))
    f1 <- function(x) {i <- seq_along(x); i - cummax(i * !x) + 1L}
    
    library(microbenchmark)
    microbenchmark(f0(x), f1(x), boolSeq(x), times=100)
    
    stopifnot(identical(f0(x), f1(x)))
    stopifnot(identical(f1(x), boolSeq(x)))
    */
    

    sourceCpping it gives me:

    Unit: microseconds
           expr       min        lq     median         uq       max neval
          f0(x) 18174.348 22163.383 24109.5820 29668.1150 78144.411   100
          f1(x)  1498.871  1603.552  2251.3610  2392.1670  2682.078   100
     boolSeq(x)   388.288   426.034   518.2875   571.4235   699.710   100
    

    Less elegant, but pretty darn close to what you were writing with R code.

    0 讨论(0)
提交回复
热议问题