I am trying to estimate how good is Python performance comparing to C++.
Here is my Python code:
a=np.random.rand(1000,1000) #type is automaically float
After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.
My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()
).
Software: Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0
1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed
a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
c=a*b;
end
toc/100
2) Python performance (%timeit a.dot(b,out=c)
): 15.5 +- 0.8
I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.
3) C++ performance. The following changes to the original (see the question) code were applied:
noalias
function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno
library
Here I used a bunch of different options and two different compilers:
3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
Execution time ~ 146 ms
3.2 Added -march=native option:
Execution time ~ 46 +-2 ms
3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):
gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3
Execution time 222 ms
3.4 Added - march=native option:
Execution time ~ 45.5 +- 1 ms
At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads()
and by using MAC OS activity monitor.
3.5 gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp
Execution time: 16.5 +- 0.7 ms
3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL
macro and that's it. It was hard to link all the libraries though:
gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl main.cpp -o my_exec_intel -O3 -fopenmp -march=native
Execution time: 14.33 +-0.26 ms. (Editor's note: this answer originally claimed to have used -DMKL_ILP64
which is not supported. Maybe it used to be, or happened to work.)
Conclusion:
Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries. I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc
. Unfortunately, this is proprietary software, unlike Intel mkl libraries.
Thanks for useful discussion,
Mikhail
Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms
Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k
Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.