In R, matrix multiplication is very optimized, i.e. is really just a call to BLAS/LAPACK. However, I\'m surprised this very naive C++ code for matrix-vector multiplication s
Josh's answer explains why R's matrix multiplication is not as fast as this naive approach. I was curious to see how much one could gain using RcppArmadillo. The code is simple enough:
arma_code <-
"arma::vec arma_mm(const arma::mat& m, const arma::vec& v) {
return m * v;
};"
arma_mm = cppFunction(code = arma_code, depends = "RcppArmadillo")
Benchmark:
> microbenchmark::microbenchmark(my_mm(m,v), m %*% v, arma_mm(m,v), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
my_mm(m, v) 71.23347 75.22364 90.13766 96.88279 98.07348 98.50182 10
m %*% v 92.86398 95.58153 106.00601 111.61335 113.66167 116.09751 10
arma_mm(m, v) 41.13348 41.42314 41.89311 41.81979 42.39311 42.78396 10
So RcppArmadillo gives us nicer syntax and better performance.
Curiosity got the better of me. Here a solution for using BLAS directly:
blas_code = "
NumericVector blas_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
char trans = 'N';
double one = 1.0, zero = 0.0;
int ione = 1;
F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
&ione, &zero, ans.begin(), &ione);
return ans;
}"
blas_mm <- cppFunction(code = blas_code, includes = "#include <R_ext/BLAS.h>")
Benchmark:
Unit: milliseconds
expr min lq mean median uq max neval
my_mm(m, v) 72.61298 75.40050 89.75529 96.04413 96.59283 98.29938 10
m %*% v 95.08793 98.53650 109.52715 111.93729 112.89662 128.69572 10
arma_mm(m, v) 41.06718 41.70331 42.62366 42.47320 43.22625 45.19704 10
blas_mm(m, v) 41.58618 42.14718 42.89853 42.68584 43.39182 44.46577 10
Armadillo and BLAS (OpenBLAS in my case) are almost the same. And the BLAS code is what R does in the end as well. So 2/3 of what R does is error checking etc.
To add another point to Ralf Stubner's solution, then you can use the following C++ version to
__restrict__
to potentially allow for vector operations (likely does not matter here as it is only reads I guess).#include <Rcpp.h>
using namespace Rcpp;
inline void mat_vec_mult_vanilla
(double const * __restrict__ m,
double const * __restrict__ v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
for(size_t j = 0; j < dm; ++j, ++v){
double * r = res;
for(size_t i = 0; i < dn; ++i, ++r, ++m)
*r += *m * *v;
}
}
inline void mat_vec_mult
(double const * __restrict__ const m,
double const * __restrict__ const v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
size_t j(0L);
double const * vj = v,
* mi = m;
constexpr size_t const ncl(8L);
{
double const * mvals[ncl];
size_t const end_j = dm - (dm % ncl),
inc = ncl * dn;
for(; j < end_j; j += ncl, vj += ncl, mi += inc){
double *r = res;
mvals[0] = mi;
for(size_t i = 1; i < ncl; ++i)
mvals[i] = mvals[i - 1L] + dn;
for(size_t i = 0; i < dn; ++i, ++r)
for(size_t ii = 0; ii < ncl; ++ii)
*r += *(vj + ii) * *mvals[ii]++;
}
}
mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}
// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
return res;
}
The result with -O3
in my Makevars file and gcc-8.3 is
set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)
all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
#R> [1] TRUE
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))
#R> [1] TRUE
bench::mark(
R = m %*% lv,
`OP's version` = my_mm(m = m, v = lv),
`BLAS` = blas_mm(m = m, v = lv),
`C++ vanilla` = mat_vec_mult_vanilla(m = m, v = lv),
`C++` = mat_vec_mult(m = m, v = lv), check = FALSE)
#R> # A tibble: 5 x 13
#R> expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
#R> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
#R> 1 R 147.9ms 151ms 6.57 78.2KB 0 4 0 609ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [4]> <tibble [4 × 3]>
#R> 2 OP's version 56.9ms 57.1ms 17.4 78.2KB 0 9 0 516ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]> <tibble [9 × 3]>
#R> 3 BLAS 90.1ms 90.7ms 11.0 78.2KB 0 6 0 545ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [6]> <tibble [6 × 3]>
#R> 4 C++ vanilla 57.2ms 57.4ms 17.4 78.2KB 0 9 0 518ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [9]> <tibble [9 × 3]>
#R> 5 C++ 51ms 51.4ms 19.3 78.2KB 0 10 0 519ms <NULL> <Rprofmem[,3] [2 × 3]> <bch:tm [10]> <tibble [10 × 3]>
So a slight improvement. The result may though be very dependent on the BLAS version. The version I used is
sessionInfo()
#R> #...
#R> Matrix products: default
#R> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
#R> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
#R> ...
The whole file I Rcpp::sourceCpp()
ed is
#include <Rcpp.h>
#include <R_ext/BLAS.h>
using namespace Rcpp;
inline void mat_vec_mult_vanilla
(double const * __restrict__ m,
double const * __restrict__ v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
for(size_t j = 0; j < dm; ++j, ++v){
double * r = res;
for(size_t i = 0; i < dn; ++i, ++r, ++m)
*r += *m * *v;
}
}
inline void mat_vec_mult
(double const * __restrict__ const m,
double const * __restrict__ const v,
double * __restrict__ const res,
size_t const dn, size_t const dm) noexcept {
size_t j(0L);
double const * vj = v,
* mi = m;
constexpr size_t const ncl(8L);
{
double const * mvals[ncl];
size_t const end_j = dm - (dm % ncl),
inc = ncl * dn;
for(; j < end_j; j += ncl, vj += ncl, mi += inc){
double *r = res;
mvals[0] = mi;
for(size_t i = 1; i < ncl; ++i)
mvals[i] = mvals[i - 1L] + dn;
for(size_t i = 0; i < dn; ++i, ++r)
for(size_t ii = 0; ii < ncl; ++ii)
*r += *(vj + ii) * *mvals[ii]++;
}
}
mat_vec_mult_vanilla(mi, vj, res, dn, dm - j);
}
// [[Rcpp::export("mat_vec_mult", rng = false)]]
NumericVector mat_vec_mult_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export("mat_vec_mult_vanilla", rng = false)]]
NumericVector mat_vec_mult_vanilla_cpp(NumericMatrix m, NumericVector v){
size_t const dn = m.nrow(),
dm = m.ncol();
NumericVector res(dn);
mat_vec_mult_vanilla(&m[0], &v[0], &res[0], dn, dm);
return res;
}
// [[Rcpp::export(rng = false)]]
NumericVector my_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
double v_j;
for(int j = 0; j < nCol; j++){
v_j = v[j];
for(int i = 0; i < nRow; i++){
ans[i] += m(i,j) * v_j;
}
}
return(ans);
}
// [[Rcpp::export(rng = false)]]
NumericVector blas_mm(NumericMatrix m, NumericVector v){
int nRow = m.rows();
int nCol = m.cols();
NumericVector ans(nRow);
char trans = 'N';
double one = 1.0, zero = 0.0;
int ione = 1;
F77_CALL(dgemv)(&trans, &nRow, &nCol, &one, m.begin(), &nRow, v.begin(),
&ione, &zero, ans.begin(), &ione);
return ans;
}
/*** R
set.seed(1)
dn <- 10001L
dm <- 10001L
m <- matrix(rnorm(dn * dm), dn, dm)
lv <- rnorm(dm)
all.equal(drop(m %*% lv), mat_vec_mult(m = m, v = lv))
all.equal(drop(m %*% lv), mat_vec_mult_vanilla(m = m, v = lv))
bench::mark(
R = m %*% lv,
`OP's version` = my_mm(m = m, v = lv),
`BLAS` = blas_mm(m = m, v = lv),
`C++ vanilla` = mat_vec_mult_vanilla(m = m, v = lv),
`C++` = mat_vec_mult(m = m, v = lv), check = FALSE)
*/
A quick glance in names.c
(here in particular) points you to do_matprod
, the C function that is called by %*%
and which is found in the file array.c
. (Interestingly, it turns out, that both crossprod
and tcrossprod
dispatch to that same function as well). Here is a link to the code of do_matprod
.
Scrolling through the function, you can see that it takes care of a number of things your naive implementation does not, including:
%*%
are of classes for which such methods have been provided. (That's what's happening in this portion of the function.) Near the end of the function, it dispatches to either of matprod or or cmatprod. Interestingly (to me at least), in the case of real matrices, if either matrix might contain NaN
or Inf
values, then matprod
dispatches (here) to a function called simple_matprod which is about as simple and straightforward as your own. Otherwise, it dispatches to one of a couple of BLAS Fortran routines which, presumably are faster, if uniformly 'well-behaved' matrix elements can be guaranteed.