Performance: Matlab vs C++ Matrix vector multiplication

后端 未结 2 2313
北荒
北荒 2021-02-15 18:17

Preamble

Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster t

相关标签:
2条回答
  • 2021-02-15 18:35

    You might be interested to look at the MATLAB Central contribution mtimesx.

    Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. MATLAB sometimes out-performs it.

    0 讨论(0)
  • 2021-02-15 18:50

    As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:

    -O3 -DNDEBUG -march=native
    

    Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.

    If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.

    Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:

    void kernel_2D(const unsigned long M, double* x, const unsigned long N,  double*  y, MatrixXd& kernel)    {
        kernel.resize(M,N);
        auto x0 = ArrayXd::Map(x,M);
        auto x1 = ArrayXd::Map(x+M,M);
        auto y0 = ArrayXd::Map(y,N);
        auto y1 = ArrayXd::Map(y+N,N);
        #pragma omp parallel for
        for(unsigned long j=0;j<N;++j)
          kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
    }
    

    With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

    0 讨论(0)
提交回复
热议问题