Why sampling matrix row is very slow?

前端 未结 1 1146
终归单人心
终归单人心 2021-02-19 05:57

I tried to do some bootstrapping and calculate colMeans, naturally I chose matrix to store data, however, it is very slow in sampling:

m[sample(n,re         


        
相关标签:
1条回答
  • 2021-02-19 06:13

    Two possibilities spring to mind on first glance, both in R's function MatrixSubset on line 265.

    It might be neither of these. Just guessing.

    1. It appears to loop in a cache inefficient direction.

    for (i = 0; i < nrs; i++) {    // rows
      ...
      for (j = 0; j < ncs; j++) {  // columns
        ...
    

    Your example has a lot of columns (8,000). Each time the inner loop fetches a new column, it needs to fetch the page of RAM holding that value from RAM into cache (most likely L2). The next fetch is a different column and so it's less likely to be able to reuse a page already in L2. A matrix is internally one huge contiguous vector: all of column 1 followed by all of column 2, etc. A page fetch is relatively expensive. Going in the "wrong" direction incurs too many page fetches. More about CPU cache here.

    A good compiler should perform Loop interchange automatically such as gcc -floop-interchange which is on by default. More here. This optimization might not be happening in this case due to the complexity of what's inside the for loops; perhaps in this case the switch statements. Or perhaps the version of R you're using on your OS wasn't compiled with a compiler with that option, or wasn't turned on.

    2. The switch() is too deep

    The switch on type is happening on each and every item in the matrix. Even though a matrix is a single type! So this is wasteful. Even if the switch is being optimized with a jump table that jump table is probably still happening for every item in the matrix ('probably' because the CPU might predict the switch). Since your example matrix is tiny at 61MB, I'm leaning more towards this being the culprit rather than going in the wrong direction.

    Proposed fix for both above (untested)

    // Check the row numbers once up front rather than 8,000 times.
    // This is a contiguous sweep and therefore almost instant
    // Declare variables i and ii locally for safety and maximum compiler optimizations
    for (int i = 0; i < nrs; i++) {
      int ii = INTEGER(sr)[i];
      if (ii != NA_INTEGER && (ii < 1 || ii > nr))
        errorcall(call, R_MSG_subs_o_b);
    }
    
    // Check the column numbers up front once rather than 2,000 times
    for (int j = 0; j < ncs; j++) {
      int jj = INTEGER(sc)[j];
      if (jj != NA_INTEGER && (jj < 1 || jj > nc))
        errorcall(call, R_MSG_subs_o_b);
    }
    
    // Now switch once on type rather than 8,000 * 2,000 times
    // Loop column-by-column not row-by-row
    
    int resi=0;  // contiguous write to result (for page efficiency)
    int ii, jj;  // the current row and column, bounds checked above
    switch (TYPEOF(x)) {
      case LGLSXP:  // the INTSXP will work for LGLSXP too, currently
      case INTSXP:
        for (int j=0; j<ncs; j++) {  // column-by-column
          jj = INTEGER(sc)[j];
          for (int i=0; i<nrs; i++) {  // within-this-column
            ii = INTEGER(sr)[i];
            INTEGER(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_INTEGER : INTEGER(x)[ii + jj * nr];
          }
        }
        break;
      case REALSXP:
        for (int j=0; j<ncs; j++) {
          jj = INTEGER(sc)[j];
          for (int i=0; i<nrs; i++) {
            ii = INTEGER(sr)[i];
            REAL(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_REAL : REAL(x)[ii + jj * nr];
          }
        }
        break;
      case ...
    

    As you can see, there is more code this way because the same for loops have to be repeated over and over within the switch() cases. Code readability and robustness reasons may be why the original code is the way it is: there is less chance of a typo in R's implementation. That's already demonstrated because I was lazy in not implementing the LGLSXP case specially for LOGICAL. I know LOGICAL is exactly the same as INTEGER currently in base R. But that might change in future so my laziness (due to code bloat) could well cause a bug in R in future if LOGICAL does change (to say char rather than int for RAM efficiency).

    One possible option to solve the code bloat problem, notice that all that's really happening is moving memory around. So all the types (other than STRSXP, VECSXP and EXPRSXP) can be done with a single double-for-loop using memcpy with the type's size. SET_STRING_ELT and SET_VECTOR_ELT still must be used to maintain reference counts on those objects. So that should be just 3 repetitions of the double for loops to maintain. Alternatively, that idiom can be wrapped into a #define which is done in other parts of R.

    Finally, whether there are any NAs in the row or columns passed in (a very common case to not request the NA'th row or NA'th column!) can be detected in the first bounds checking loop. If there are no NAs then the deepest ternary ((ii == NA_INTEGER || jj == NA_INTEGER) ? :) (2000 * 8000 calls to that branch) can be saved by raising that branch outside. But with the cost of more complex repeated code. However, perhaps branch prediction would kick in reliably on all architectures and we shouldn't worry about that.

    data.table does both the memcpy trick and deep branch saving in some but not all places. It has also started to subset in parallel, column by column. But not in this case yet just because it's new and still being rolled out (setkey is very similar and is already parallel). The master thread handles the character and list columns one by one though (not in parallel) since SET_STRING_ELT and SET_VECTOR_ELT are not thread-safe in R. The other threads handle all the integer, real, complex and raw columns in parallel. It then goes as fast as memory io can go.

    I don't really see the difference that you see on 61MB but scaling up to (still small) 610MB by increasing number of columns 10x to 80,000 I do see a difference.

    n = 2000
    nc = 8000    # same size as your example (61MB), on my laptop
    microbenchmark(m[s,], DF[s,],DT[s,])
    Unit: milliseconds
        expr       min        lq      mean    median        uq      max neval
      m[s, ] 108.75182 112.11678 118.60111 114.58090 120.07952 168.6079   100
     DF[s, ] 100.95019 105.88253 116.04507 110.84693 118.08092 163.9666   100
     DT[s, ]  63.78959  69.07341  80.72039  72.69873  96.51802 136.2016   100
    
    n = 2000
    nc = 80000     # 10x bigger (610MB)
    microbenchmark(m[s,], DF[s,],DT[s,])
    Unit: milliseconds
        expr       min        lq      mean    median        uq      max neval
      m[s, ] 1990.3343 2010.1759 2055.9847 2032.9506 2057.2498 2733.278   100
     DF[s, ] 1083.0373 1212.6633 1265.5346 1234.1558 1300.7502 2105.177   100
     DT[s, ]  698.1295  830.3428  865.5918  862.5773  907.7225 1053.393   100
    

    I have 128MB of L4 cache, though. I guess you have less cache. The entire 61MB fits in my L4 cache so I don't really notice the cache inefficiency at that size.

    $ lscpu
    Architecture:          x86_64
    CPU op-mode(s):        32-bit, 64-bit
    Byte Order:            Little Endian
    CPU(s):                8
    On-line CPU(s) list:   0-7
    Thread(s) per core:    2
    Core(s) per socket:    4
    Socket(s):             1
    NUMA node(s):          1
    Vendor ID:             GenuineIntel
    CPU family:            6
    Model:                 70
    Model name:            Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
    Stepping:              1
    CPU MHz:               3345.343
    CPU max MHz:           4000.0000
    CPU min MHz:           800.0000
    BogoMIPS:              5587.63
    Virtualization:        VT-x
    L1d cache:             32K
    L1i cache:             32K
    L2 cache:              256K
    L3 cache:              6144K
    L4 cache:              131072K
    NUMA node0 CPU(s):     0-7
    
    0 讨论(0)
提交回复
热议问题