Elementwise matrix multiplication: R versus Rcpp (How to speed this code up?)

后端 未结 3 1131
北荒
北荒 2021-02-03 14:30

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calcu

相关标签:
3条回答
  • 2021-02-03 15:12

    My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

    Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

    // it may be necessary on your system to uncomment one of the following
    //#define restrict __restrict__ // gcc/clang
    //#define restrict __restrict   // MS Visual Studio
    //#define restrict              // remove it completely
    
    #include <Rcpp.h>
    using namespace Rcpp;
    
    #include <cstdlib>
    using std::size_t;
    
    void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                               size_t numRows, size_t numCols,
                                               const double* restrict y)
    {
      if (numRows == 0 || numCols == 0) return;
    
      for (size_t col = 0; col < numCols; ++col) {
        double* restrict x_col = x + col * numRows;
    
        for (size_t row = 0; row < numRows; ++row) {
          x_col[row] *= y[row];
        }
      }
    }
    
    // [[Rcpp::export]]
    NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                              const NumericVector& y)
    {
      // do some dimension checking here
    
      hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                            y.begin());
    
      return X;
    }
    

    Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

    #include <Rcpp.h>
    using namespace Rcpp;
    
    #include <cstdlib>
    using std::size_t;
    
    #include <R.h>
    #include <Rdefines.h>
    
    void hadamardMultiplyMatrixByVector(const double* restrict x,
                                        size_t numRows, size_t numCols,
                                        const double* restrict y,
                                        double* restrict z)
    {
      if (numRows == 0 || numCols == 0) return;
    
      for (size_t col = 0; col < numCols; ++col) {
        const double* restrict x_col = x + col * numRows;
        double* restrict z_col = z + col * numRows;
    
        for (size_t row = 0; row < numRows; ++row) {
          z_col[row] = x_col[row] * y[row];
        }
      }
    }
    
    // [[Rcpp::export]]
    SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
    {
      size_t numRows = X.nrow();
      size_t numCols = X.ncol();
    
      // do some dimension checking here
    
      SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
      SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
      int* dims = INTEGER(dimsExpr);
      dims[0] = (int) numRows;
      dims[1] = (int) numCols;
      Rf_setAttrib(Z, R_DimSymbol, dimsExpr);
    
      hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));
    
      UNPROTECT(2);
    
      return Z;
    }
    

    If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

    Some R code to benchmark:

    require(rbenchmark)
    
    n <- 50000
    k <- 50
    X <- matrix(rnorm(n*k), nrow=n)
    e <- rnorm(n)
    
    R_matvecprod_elwise <- function(mat, vec) mat*vec
    
    all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
    X_dup <- X + 0
    all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))
    
    benchmark(R_matvecprod_elwise(X, e),
              C_matvecprod_elwise(X, e),
              C_matvecprod_elwise_inplace(X, e),
              columns = c("test", "replications", "elapsed", "relative"),
              order = "relative", replications = 1000)
    

    And the results:

                                   test replications elapsed relative
    3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
    2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
    1         R_matvecprod_elwise(X, e)         1000  10.670    3.217
    

    Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

    Edit:

    Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

    0 讨论(0)
  • 2021-02-03 15:22

    If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.

    // [[Rcpp::export]]
    NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
      unsigned int ncol = X.ncol();
      unsigned int nrow = X.nrow();
      int counter = 0;
      for (unsigned int j=0; j<ncol; j++) {
        for (unsigned int i=0; i<nrow; i++)  {
          X[counter++] *= y[i];
        }
      }
      return X;
    }
    

    Here is what I get on my machine

     > library(microbenchmark)
     > microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e),  Rcpp=Rcpp_matvecprod_elwise(X, e))
    
    Unit: milliseconds
     expr       min        lq    median       uq      max neval
        R  8.262845  9.386214 10.542599 11.53498 12.77650   100
     Arma 18.852685 19.872929 22.782958 26.35522 83.93213   100
     Rcpp  6.391219  6.640780  6.940111  7.32773  7.72021   100
    
    > all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
    [1] TRUE
    
    0 讨论(0)
  • 2021-02-03 15:28

    For starters, I'd write the Armadillo version (interface) as

    #include <RcppArmadillo.h>
    // [[Rcpp::depends(RcppArmadillo)]]
    
    using namespace Rcpp;
    using namespace arma;
    
    // [[Rcpp::export]]
    arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
      int k = X.n_cols ;
      arma::mat Y = repmat(y, 1, k) ;  // 
      arma::mat out = X % Y;  
      return out;
    }
    

    as you're doing an additional conversion in and out (though the wrap() gets added by the glue code). The const & is notional (as you learned via your last question, a SEXP is a pointer object that is lightweight to copy) but better style.

    You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.

    Edit: If you really want something cheap and fast, I would just do this:

    // [[Rcpp::export]]
    mat cheapHadamard(mat X, vec y) {
        // should row dim of X versus length of Y here
        for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
        return X;
    }
    

    which allocates no new memory and will hence be faster, and probably be competitive with R.

    Test output:

    R> cheapHadamard(testmat, testvec)
         [,1] [,2] [,3]
    [1,]    1    4    7
    [2,]    4   10   16
    [3,]    9   18   27
    R> 
    
    0 讨论(0)
提交回复
热议问题