Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

后端 未结 1 1339
一向
一向 2021-02-03 14:54

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float         


        
1条回答
  •  太阳男子
    2021-02-03 15:28

    After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

    My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

    Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

    1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

    a=rand(1000,1000);
    b=rand(1000,1000);
    c=rand(1000,1000);
    tic
    for i=1:100
        c=a*b;
    end
    toc/100
    

    2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8

    I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

    3) C++ performance. The following changes to the original (see the question) code were applied:

    • noalias function to avoid unnecessary temporal matrices creation.

    • Time was measured with c++11 chorno library

    Here I used a bunch of different options and two different compilers:

    3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
    

    Execution time ~ 146 ms

    3.2 Added -march=native option:
    

    Execution time ~ 46 +-2 ms

    3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):
    
    gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
    

    Execution time 222 ms

    3.4 Added - march=native option:
    

    Execution time ~ 45.5 +- 1 ms

    At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

    3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp
    

    Execution time: 16.5 +- 0.7 ms

    3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

    gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native
    

    Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported. Maybe it used to be, or happened to work.)

    Conclusion:

    • Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).

    • CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.

    • It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.

    Thanks for useful discussion,

    Mikhail

    Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms

    Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

    Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.

    0 讨论(0)
提交回复
热议问题