More efficient strategy for which() or match()

后端 未结 3 1568
情话喂你
情话喂你 2020-12-31 23:30

I have a vector of positive and negative numbers

vec<-c(seq(-100,-1), rep(0,20), seq(1,100))

the vector is larger than the example, and

相关标签:
3条回答
  • 2020-12-31 23:56

    You could use which.min

     which.min(vec < 0) - 1L
    

    This will return the first FALSE value, i.e. the first 0.

    0 讨论(0)
  • 2021-01-01 00:03

    The solutions offered so far all imply creating a logical(length(vec)) and doing a full or partial scan on this. As you note, the vector is sorted. We can exploit this by doing a binary search. I started thinking I'd be super-clever and implement this in C for even greater speed, but had trouble with debugging the indexing of the algorithm (which is the tricky part!). So I wrote it in R:

    f3 <- function(x) {
        imin <- 1L
        imax <- length(x)
        while (imax >= imin) {
            imid <- as.integer(imin + (imax - imin) / 2)
            if (x[imid] >= 0)
                imax <- imid - 1L
            else
                imin <- imid + 1L
        }
        imax
    }
    

    For comparison with the other suggestions

    f0 <- function(v) length(which(v < 0))
    f1 <- function(v) sum(v < 0)
    f2 <- function(v) which.min(v < 0) - 1L
    

    and for fun

    library(compiler)
    f3.c <- cmpfun(f3)
    

    Leading to

    > vec <- c(seq(-100,-1,length.out=1e6), rep(0,20), seq(1,100,length.out=1e6))
    > identical(f0(vec), f1(vec))
    [1] TRUE
    > identical(f0(vec), f2(vec))
    [1] TRUE
    > identical(f0(vec), f3(vec))
    [1] TRUE
    > identical(f0(vec), f3.c(vec))
    [1] TRUE
    > microbenchmark(f0(vec), f1(vec), f2(vec), f3(vec), f3.c(vec))
    Unit: microseconds
          expr       min        lq     median         uq       max neval
       f0(vec) 15274.275 15347.870 15406.1430 15605.8470 19890.903   100
       f1(vec) 15513.807 15575.229 15651.2970 17064.8830 18326.293   100
       f2(vec) 21473.814 21558.989 21679.3210 22733.1710 27435.889   100
       f3(vec)    51.715    56.050    75.4495    78.5295   100.730   100
     f3.c(vec)    11.612    17.147    28.5570    31.3160    49.781   100
    

    Probably there are some tricky edge cases that I've got wrong! Moving to C, I did

    library(inline)
    f4 <- cfunction(c(x = "numeric"), "
        int imin = 0, imax = Rf_length(x) - 1, imid;
        while (imax >= imin) {
            imid = imin + (imax - imin) / 2;
            if (REAL(x)[imid] >= 0)
                imax = imid - 1;
            else
                imin = imid + 1;
        }
        return ScalarInteger(imax + 1);
    ")
    

    with

    > identical(f3(vec), f4(vec))
    [1] TRUE
    > microbenchmark(f3(vec), f3.c(vec), f4(vec))
    Unit: nanoseconds
          expr   min      lq  median      uq   max neval
       f3(vec) 52096 53192.0 54918.5 55539.0 69491   100
     f3.c(vec) 10924 12233.5 12869.0 13410.0 20038   100
       f4(vec)   553   796.0   893.5  1004.5  2908   100
    

    findInterval came up when a similar question was asked on the R-help list. It is slow but safe, checking that vec is actually sorted and dealing with NA values. If one wants to live on the edge (arguably no worse that implementing f3 or f4) then

    f5.i <- function(v)
        .Internal(findInterval(v, 0 - .Machine$double.neg.eps, FALSE, FALSE))
    

    is nearly as fast as the C implementation, but likely more robust and vectorized (i.e., look up a vector of values in the second argument, for easy range-like calculations).

    0 讨论(0)
  • 2021-01-01 00:14

    Use sum() and logical comparison:

    sum( vec < 0 )
    [1] 100
    

    This will be pretty quick, and when you sum a logical, TRUE is 1 and FALSE is 0 so the total will be the number of negative values.

    Uh oh, I feel the need for a benchmarking comparison... :-) Vector length is 2e5

    library(microbenchmark)
    vec<-c(seq(-100,-1,length.out=1e5), rep(0,20), seq(1,100,length.out=1e5))
    microbenchmark( (which.min(vec < 0) - 1L) , (sum( vec < 0 )) )
    
    Unit: milliseconds
                          expr      min       lq   median       uq       max neval
     (which.min(vec < 0) - 1L) 1.883847 2.130746 2.554725 3.141787 75.943911   100
                (sum(vec < 0)) 1.398100 1.500639 1.508688 1.745088  2.662164   100
    
    0 讨论(0)
提交回复
热议问题