I tried to do some bootstrapping and calculate colMeans
, naturally I chose matrix to store data, however, it is very slow in sampling:
m[sample(n,re
Two possibilities spring to mind on first glance, both in R's function MatrixSubset on line 265.
It might be neither of these. Just guessing.
for (i = 0; i < nrs; i++) { // rows
...
for (j = 0; j < ncs; j++) { // columns
...
Your example has a lot of columns (8,000). Each time the inner loop fetches a new column, it needs to fetch the page of RAM holding that value from RAM into cache (most likely L2). The next fetch is a different column and so it's less likely to be able to reuse a page already in L2. A matrix
is internally one huge contiguous vector: all of column 1 followed by all of column 2, etc. A page fetch is relatively expensive. Going in the "wrong" direction incurs too many page fetches. More about CPU cache here.
A good compiler should perform Loop interchange automatically such as gcc -floop-interchange
which is on by default. More here. This optimization might not be happening in this case due to the complexity of what's inside the for loops; perhaps in this case the switch statements. Or perhaps the version of R you're using on your OS wasn't compiled with a compiler with that option, or wasn't turned on.
The switch on type is happening on each and every item in the matrix
. Even though a matrix
is a single type! So this is wasteful. Even if the switch is being optimized with a jump table that jump table is probably still happening for every item in the matrix ('probably' because the CPU might predict the switch). Since your example matrix
is tiny at 61MB, I'm leaning more towards this being the culprit rather than going in the wrong direction.
// Check the row numbers once up front rather than 8,000 times.
// This is a contiguous sweep and therefore almost instant
// Declare variables i and ii locally for safety and maximum compiler optimizations
for (int i = 0; i < nrs; i++) {
int ii = INTEGER(sr)[i];
if (ii != NA_INTEGER && (ii < 1 || ii > nr))
errorcall(call, R_MSG_subs_o_b);
}
// Check the column numbers up front once rather than 2,000 times
for (int j = 0; j < ncs; j++) {
int jj = INTEGER(sc)[j];
if (jj != NA_INTEGER && (jj < 1 || jj > nc))
errorcall(call, R_MSG_subs_o_b);
}
// Now switch once on type rather than 8,000 * 2,000 times
// Loop column-by-column not row-by-row
int resi=0; // contiguous write to result (for page efficiency)
int ii, jj; // the current row and column, bounds checked above
switch (TYPEOF(x)) {
case LGLSXP: // the INTSXP will work for LGLSXP too, currently
case INTSXP:
for (int j=0; j<ncs; j++) { // column-by-column
jj = INTEGER(sc)[j];
for (int i=0; i<nrs; i++) { // within-this-column
ii = INTEGER(sr)[i];
INTEGER(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_INTEGER : INTEGER(x)[ii + jj * nr];
}
}
break;
case REALSXP:
for (int j=0; j<ncs; j++) {
jj = INTEGER(sc)[j];
for (int i=0; i<nrs; i++) {
ii = INTEGER(sr)[i];
REAL(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_REAL : REAL(x)[ii + jj * nr];
}
}
break;
case ...
As you can see, there is more code this way because the same for
loops have to be repeated over and over within the switch()
cases. Code readability and robustness reasons may be why the original code is the way it is: there is less chance of a typo in R's implementation. That's already demonstrated because I was lazy in not implementing the LGLSXP case specially for LOGICAL. I know LOGICAL is exactly the same as INTEGER currently in base R. But that might change in future so my laziness (due to code bloat) could well cause a bug in R in future if LOGICAL does change (to say char
rather than int
for RAM efficiency).
One possible option to solve the code bloat problem, notice that all that's really happening is moving memory around. So all the types (other than STRSXP, VECSXP and EXPRSXP) can be done with a single double-for-loop using memcpy
with the type's size. SET_STRING_ELT
and SET_VECTOR_ELT
still must be used to maintain reference counts on those objects. So that should be just 3 repetitions of the double for
loops to maintain. Alternatively, that idiom can be wrapped into a #define
which is done in other parts of R.
Finally, whether there are any NAs in the row or columns passed in (a very common case to not request the NA'th row or NA'th column!) can be detected in the first bounds checking loop. If there are no NAs then the deepest ternary ((ii == NA_INTEGER || jj == NA_INTEGER) ? :
) (2000 * 8000 calls to that branch) can be saved by raising that branch outside. But with the cost of more complex repeated code. However, perhaps branch prediction would kick in reliably on all architectures and we shouldn't worry about that.
data.table
does both the memcpy
trick and deep branch saving in some but not all places. It has also started to subset in parallel, column by column. But not in this case yet just because it's new and still being rolled out (setkey
is very similar and is already parallel). The master thread handles the character
and list
columns one by one though (not in parallel) since SET_STRING_ELT
and SET_VECTOR_ELT
are not thread-safe in R. The other threads handle all the integer, real, complex and raw columns in parallel. It then goes as fast as memory io can go.
I don't really see the difference that you see on 61MB but scaling up to (still small) 610MB by increasing number of columns 10x to 80,000 I do see a difference.
n = 2000
nc = 8000 # same size as your example (61MB), on my laptop
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
expr min lq mean median uq max neval
m[s, ] 108.75182 112.11678 118.60111 114.58090 120.07952 168.6079 100
DF[s, ] 100.95019 105.88253 116.04507 110.84693 118.08092 163.9666 100
DT[s, ] 63.78959 69.07341 80.72039 72.69873 96.51802 136.2016 100
n = 2000
nc = 80000 # 10x bigger (610MB)
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
expr min lq mean median uq max neval
m[s, ] 1990.3343 2010.1759 2055.9847 2032.9506 2057.2498 2733.278 100
DF[s, ] 1083.0373 1212.6633 1265.5346 1234.1558 1300.7502 2105.177 100
DT[s, ] 698.1295 830.3428 865.5918 862.5773 907.7225 1053.393 100
I have 128MB of L4 cache, though. I guess you have less cache. The entire 61MB fits in my L4 cache so I don't really notice the cache inefficiency at that size.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 70
Model name: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Stepping: 1
CPU MHz: 3345.343
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
L4 cache: 131072K
NUMA node0 CPU(s): 0-7