Coming from R programming, I\'m in the process of expanding to compiled code in the form of C/C++ with Rcpp. As a hands on exercise on the
"why is Cpp_rowSums() significantly faster than Cpp_colSums()?" - when fetching "row major" the CPUs prefetcher can predict what you are doing and fetch the next bunch of data you need from main memory to the CPUs cache before you need it. This speeds up your access to the data.
When you access "column major" the prefetcher has a much harder job predicting what you are going to need next, so it won't be stuffing things into cache memory ahead of time as efficiently (if at all) - this slows you down.
CPUs love linear access to data. If you don't do what they love you pay the price of cache misses and main memory access latencies.
First, let me show the timing statistics on my laptop. I use a 5000 x 5000 matrix which is sufficient for benchmarking, and microbenchmark
package is used for 100 evaluations.
Unit: milliseconds
expr min lq mean median uq max
colSums(x) 71.40671 71.64510 71.80394 71.72543 71.80773 75.07696
Cpp_colSums(x) 71.29413 71.42409 71.65525 71.48933 71.56241 77.53056
Sugar_colSums(x) 73.05281 73.19658 73.38979 73.25619 73.31406 76.93369
Arma_colSums(x) 39.08791 39.34789 39.57979 39.43080 39.60657 41.70158
rowSums(x) 177.33477 187.37805 187.57976 187.49469 187.73155 194.32120
Cpp_rowSums(x) 54.00498 54.37984 54.70358 54.49165 54.73224 64.16104
Sugar_rowSums(x) 54.17001 54.38420 54.73654 54.56275 54.75695 61.80466
Arma_rowSums(x) 49.54407 49.77677 50.13739 49.90375 50.06791 58.29755
C code in R core is not always better than what we can write ourselves. That Cpp_rowSums
is faster than rowSums
shows this. I don't feel myself competent to explain why R's version is slower than it should be. I will focuse on: how we can further optimize our own colSums
and rowSums
to beat Armadillo. Note that I write C, use R's old C interface and do compilation with R CMD SHLIB
.
colSums
and rowSums
?If we have an n x n
matrix that is much larger than the capacity of a CPU cache, colSums
loads n x n
data from RAM to cache, but rowSums
loads as twice as many, i.e., 2 x n x n
.
Think about the resulting vector that holds the sum: how many times this length-n
vector is loaded into cache from RAM? For colSums
, it is loaded only once, but for rowSums
, it is loaded n
times. Each time you add a matrix column to it, it is loaded into cache but then evicted since it is too big.
For a large n
:
colSums
causes n x n + n
data load from RAM to cache;rowSums
causes n x n + n x n
data load from RAM to cache.In other words, rowSums
is in theory less memory efficient, and is likely to be slower.
colSums
?Since the data flow between RAM and cache is readily optimal, the only improvement is loop unrolling. Unrolling the inner loop (the summation loop) by a depth of 2 is sufficient and we will see a 2x boost.
Loop unrolling works as it enables CPU's instruction pipeline. If we just do one addition per iteration, no pipelining is possible; with two additions this instruction-level parallelism starts to work. We can also unroll the loop by a depth of 4, but my experience is that a depth-2 unrolling is sufficient to gain most of the benefit from loop unrolling.
rowSums
?Optimization of data flow is the first step. We need to first do cache blocking to reduce the data transfer from 2 x n x n
down to n x n
.
Chop this n x n
matrix into a number of row chunks: each being 2040 x n
(the last chunk may be smaller), then apply the ordinary rowSums
chunk by chunk. For each chunk, the accumulator vector has length-2040, about half of what a 32KB CPU cache can hold. The other half is reversed for a matrix column added to this accumulator vector. In this way, the accumulator vector can be hold in the cache until all matrix columns in this chunk are processed. As a result, the accumulator vector is only loaded into cache once, hence the overall memory performance is as good as that for colSums
.
Now we can further apply loop unrolling for the rowSums
in each chunk. Unroll both the outer loop and inner loop by a depth of 2, we will see a boost. Once the outer loop is unrolled, the chunk size should be reduced to 1360, as now we need space in the cache to hold three length-1360 vectors per outer loop iteration.
Writing code with loop unrolling can be a nasty job as we now need to write several different versions for a function.
For colSums
, we need two versions:
colSums_1x1
: both inner and outer loops are unrolled with depth 1, i.e., this is a version without loop unrolling;colSums_2x1
: no outer loop unrolling, while inner loop is unrolled with depth 2.For rowSums
we can have up to four versions, rowSums_sxt
, where s = 1 or 2
is the unrolling depth for inner loop and t = 1 or 2
is the unrolling depth for outer loop.
Code writing can be very tedious if we write each version one by one. After many years or frustration on this I developed an "automatic code / version generation" trick using inlined template functions and macros.
#include <stdlib.h>
#include <Rinternals.h>
static inline void colSums_template_sx1 (size_t s,
double *A, size_t LDA,
size_t nr, size_t nc,
double *sum) {
size_t nrc = nr % s, i;
double *A_end = A + LDA * nc, a0, a1;
for (; A < A_end; A += LDA) {
a0 = 0.0; a1 = 0.0; // accumulator register variables
if (nrc > 0) a0 = A[0]; // is there a "fractional loop"?
for (i = nrc; i < nr; i += s) { // main loop of depth-s
a0 += A[i]; // 1st iteration
if (s > 1) a1 += A[i + 1]; // 2nd iteration
}
if (s > 1) a0 += a1; // combine two accumulators
*sum++ = a0; // write-back
}
}
#define macro_define_colSums(s, colSums_sx1) \
SEXP colSums_sx1 (SEXP matA) { \
double *A = REAL(matA); \
size_t nrow_A = (size_t)nrows(matA); \
size_t ncol_A = (size_t)ncols(matA); \
SEXP result = PROTECT(allocVector(REALSXP, ncols(matA))); \
double *sum = REAL(result); \
colSums_template_sx1(s, A, nrow_A, nrow_A, ncol_A, sum); \
UNPROTECT(1); \
return result; \
}
macro_define_colSums(1, colSums_1x1)
macro_define_colSums(2, colSums_2x1)
The template function computes (in R-syntax) sum <- colSums(A[1:nr, 1:nc])
for a matrix A
with LDA
(leading dimension of A) rows. The parameter s
is a version control on inner loop unrolling. The template function looks horrible at first glance as it contains many if
. However, it is declared static inline
. If it is called by passing in known constant 1 or 2 to s
, an optimizing compiler is able to evaluate those if
at compile-time, eliminate unreachable code and drop "set-but-not-used" variables (registers variables that are initialized, modified but not written back to RAM).
The macro is used for function declaration. Accepting a constant s
and a function name, it generates a function with desired loop unrolling version.
The following is for rowSums
.
static inline void rowSums_template_sxt (size_t s, size_t t,
double *A, size_t LDA,
size_t nr, size_t nc,
double *sum) {
size_t ncr = nc % t, nrr = nr % s, i;
double *A_end = A + LDA * nc, *B;
double a0, a1;
for (i = 0; i < nr; i++) sum[i] = 0.0; // necessary initialization
if (ncr > 0) { // is there a "fractional loop" for the outer loop?
if (nrr > 0) sum[0] += A[0]; // is there a "fractional loop" for the inner loop?
for (i = nrr; i < nr; i += s) { // main inner loop with depth-s
sum[i] += A[i];
if (s > 1) sum[i + 1] += A[i + 1];
}
A += LDA;
}
for (; A < A_end; A += t * LDA) { // main outer loop with depth-t
if (t > 1) B = A + LDA;
if (nrr > 0) { // is there a "fractional loop" for the inner loop?
a0 = A[0]; if (t > 1) a0 += A[LDA];
sum[0] += a0;
}
for(i = nrr; i < nr; i += s) { // main inner loop with depth-s
a0 = A[i]; if (t > 1) a0 += B[i];
sum[i] += a0;
if (s > 1) {
a1 = A[i + 1]; if (t > 1) a1 += B[i + 1];
sum[i + 1] += a1;
}
}
}
}
#define macro_define_rowSums(s, t, rowSums_sxt) \
SEXP rowSums_sxt (SEXP matA, SEXP chunk_size) { \
double *A = REAL(matA); \
size_t nrow_A = (size_t)nrows(matA); \
size_t ncol_A = (size_t)ncols(matA); \
SEXP result = PROTECT(allocVector(REALSXP, nrows(matA))); \
double *sum = REAL(result); \
size_t block_size = (size_t)asInteger(chunk_size); \
size_t i, block_size_i; \
if (block_size > nrow_A) block_size = nrow_A; \
for (i = 0; i < nrow_A; i += block_size_i) { \
block_size_i = nrow_A - i; if (block_size_i > block_size) block_size_i = block_size; \
rowSums_template_sxt(s, t, A, nrow_A, block_size_i, ncol_A, sum); \
A += block_size_i; sum += block_size_i; \
} \
UNPROTECT(1); \
return result; \
}
macro_define_rowSums(1, 1, rowSums_1x1)
macro_define_rowSums(1, 2, rowSums_1x2)
macro_define_rowSums(2, 1, rowSums_2x1)
macro_define_rowSums(2, 2, rowSums_2x2)
Note that the template function now accepts s
and t
, and the function to be defined by the macro has applied row chunking.
Even though I've left some comments along the code, the code is probably still not easy to follow, but I can't take more time to explain in greater details.
To use them, copy and paste them into a C file called "matSums.c" and compile it with R CMD SHLIB -c matSums.c
.
For the R side, define the following functions in "matSums.R".
colSums_zheyuan <- function (A, s) {
dyn.load("matSums.so")
if (s == 1) result <- .Call("colSums_1x1", A)
if (s == 2) result <- .Call("colSums_2x1", A)
dyn.unload("matSums.so")
result
}
rowSums_zheyuan <- function (A, chunk.size, s, t) {
dyn.load("matSums.so")
if (s == 1 && t == 1) result <- .Call("rowSums_1x1", A, as.integer(chunk.size))
if (s == 2 && t == 1) result <- .Call("rowSums_2x1", A, as.integer(chunk.size))
if (s == 1 && t == 2) result <- .Call("rowSums_1x2", A, as.integer(chunk.size))
if (s == 2 && t == 2) result <- .Call("rowSums_2x2", A, as.integer(chunk.size))
dyn.unload("matSums.so")
result
}
Now let's have a benchmark, again with a 5000 x 5000 matrix.
A <- matrix(0, 5000, 5000)
library(microbenchmark)
source("matSums.R")
microbenchmark("col0" = colSums(A),
"col1" = colSums_zheyuan(A, 1),
"col2" = colSums_zheyuan(A, 2),
"row0" = rowSums(A),
"row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))
On my laptop I get:
Unit: milliseconds
expr min lq mean median uq max neval
col0 65.33908 71.67229 71.87273 71.80829 71.89444 111.84177 100
col1 67.16655 71.84840 72.01871 71.94065 72.05975 77.84291 100
col2 35.05374 38.98260 39.33618 39.09121 39.17615 53.52847 100
row0 159.48096 187.44225 185.53748 187.53091 187.67592 202.84827 100
row1 49.65853 54.78769 54.78313 54.92278 55.08600 60.27789 100
row2 49.42403 54.56469 55.00518 54.74746 55.06866 60.31065 100
row3 37.43314 41.57365 41.58784 41.68814 41.81774 47.12690 100
row4 34.73295 37.20092 38.51019 37.30809 37.44097 99.28327 100
Note how loop unrolling speeds up both colSums
and rowSums
. And with full optimization ("col2" and "row4"), we beat Armadillo (see the timing table at the beginning of this answer).
The row chunking strategy does not clearly yield benefit in this case. Let's try a matrix with millions of rows.
A <- matrix(0, 1e+7, 20)
microbenchmark("row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))
I get
Unit: milliseconds
expr min lq mean median uq max neval
row1 604.7202 607.0256 617.1687 607.8580 609.1728 720.1790 100
row2 514.7488 515.9874 528.9795 516.5193 521.4870 636.0051 100
row3 412.1884 413.8688 421.0790 414.8640 419.0537 525.7852 100
row4 377.7918 379.1052 390.4230 379.9344 386.4379 476.9614 100
In this case we observe the gains from cache blocking.
Basically this answer has addressed all the issues, except for the following:
rowSums
is less efficient than it should be.rowSums
("row1") is faster than colSums
("col1").Again, I cannot explain the first and actually I don't care that since we can easily write a version that is faster than R's built-in version.
The 2nd is definitely worth pursuing. I copy in my comments in our discussion room for a record.
This issue is down to this: "why adding up a single vector is slower than adding two vectors element-wise?"
I see similar phenomenon from time to time. The first time I encountered this strange behavior was when I, a few years ago, coded my own matrix-matrix multiplication. I found that DAXPY is faster than DDOT.
DAXPY does this:
y += a * x
, wherex
andy
are vectors anda
is a scalar; DDOT does this:a += x * y
.Given than DDOT is a reduction operation I expect that it is faster than DAXPY. But no, DAXPY is faster.
However, as soon as I unroll the loop in the triple loop-nest of the matrix-multiplication, DDOT is much faster than DAXPY.
A very similar thing happens to your issue. A reduction operation:
a = x[1] + x[2] + ... + x[n]
is slower than element-wise add:y[i] += x[i]
. But once loop unrolling is done, the advantage of the latter is lost.I am not sure whether the following explanation is true as I have no evidence.
The reduction operation has a dependency chain so the computation is strictly serial; on the other hand, element-wise operation has no dependency chain, so that CPU may do better with it.
As soon as we unroll the loop, each iteration has more arithmetics to do and CPU can do pipelining in both cases. The true advantage of the reduction operation can then be observed.
rowSums2
and colSums2
from matrixStats
Still using the 5000 x 5000 example above.
A <- matrix(0, 5000, 5000)
library(microbenchmark)
source("matSums.R")
library(matrixStats) ## NEW
microbenchmark("col0" = base::colSums(A),
"col*" = matrixStats::colSums2(A), ## NEW
"col1" = colSums_zheyuan(A, 1),
"col2" = colSums_zheyuan(A, 2),
"row0" = base::rowSums(A),
"row*" = matrixStats::rowSums2(A), ## NEW
"row1" = rowSums_zheyuan(A, nrow(A), 1, 1),
"row2" = rowSums_zheyuan(A, 2040, 1, 1),
"row3" = rowSums_zheyuan(A, 1360, 1, 2),
"row4" = rowSums_zheyuan(A, 1360, 2, 2))
Unit: milliseconds
expr min lq mean median uq max neval
col0 71.53841 71.72628 72.13527 71.81793 71.90575 78.39645 100
col* 75.60527 75.87255 76.30752 75.98990 76.18090 87.07599 100
col1 71.67098 71.86180 72.06846 71.93872 72.03739 77.87816 100
col2 38.88565 39.03980 39.57232 39.08045 39.16790 51.39561 100
row0 187.44744 187.58121 188.98930 187.67168 187.86314 206.37662 100
row* 158.08639 158.26528 159.01561 158.34864 158.62187 174.05457 100
row1 54.62389 54.81724 54.97211 54.92394 55.04690 56.33462 100
row2 54.15409 54.44208 54.78769 54.59162 54.76073 60.92176 100
row3 41.43393 41.63886 42.57511 41.73538 41.81844 111.94846 100
row4 37.07175 37.25258 37.45033 37.34456 37.47387 43.14157 100
I don't see performance advantage of rowSums2
and colSums2
.