Benchmarking matrix multiplication performance: C++ (eigen) is much slower than Python

后端未结

关注

 1  1342

一向 2021-02-03 14:54

I am trying to estimate how good is Python performance comparing to C++.

Here is my Python code:

a=np.random.rand(1000,1000) #type is automaically float


      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   太阳男子
                                             
                
                
                (楼主)
            
              
              
                2021-02-03 15:28
              

            
            
                        
After long and painful installations and compilations I've performed benchmarks in Matlab, C++ and Python.

My computer: MAC OS High Sierra 10.13.6 with Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz (4 cores, 8 threads). I have Radeon Pro 560 4096 MB so that there was no GPUs involved in these tests (and I never configured openCL and didn't see it in np.show_config()).

Software:
Matlab 2018a, Python 3.6, C++ compilers: Apple LLVM version 9.1.0 (clang-902.0.39.2), g++-8 (Homebrew GCC 8.2.0) 8.2.0

1) Matlab performance: time= (14.3 +- 0.7 ) ms with 10 runs performed

a=rand(1000,1000);
b=rand(1000,1000);
c=rand(1000,1000);
tic
for i=1:100
    c=a*b;
end
toc/100


2) Python performance (%timeit a.dot(b,out=c)): 15.5 +- 0.8 

I've also installed mkl libraries for python. With numpy linked against mkl: 14.4+-0.7 - it helps, but very little.

3) C++ performance. The following changes to the original (see the question) code were applied: 


noalias function to avoid unnecessary temporal matrices creation.
Time was measured with c++11 chorno library


Here I used a bunch of different options and two different compilers:

3.1 clang++ -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3


Execution time ~ 146 ms

3.2 Added -march=native option:


Execution time ~ 46 +-2 ms

3.3 Changed compiler to GNU g++ (in my mac it is called gpp by custom-defined alias):

gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3


Execution time 222 ms

3.4 Added - march=native option:


Execution time ~ 45.5 +- 1 ms

At this point I realized that Eigen does not use multiple threads. I installed openmp and added -fopenmp flag. Note that on the latest clang version openmp does not work, thus I had to use g++ from now on. I also made sure I am actually using all available threads by monitoring the value of Eigen::nbthreads() and by using MAC OS activity monitor.

3.5  gpp -std=c++11 -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen main.cpp -O3 -march=native -fopenmp


Execution time: 16.5 +- 0.7 ms

3.6 Finally, I installed Intel mkl libraries. In the code it is quite easy to use them: I've just added #define EIGEN_USE_MKL_ALL macro and that's it. It was hard to link all the libraries though:

gpp -std=c++11 -DMKL_LP64 -m64 -I${MKLROOT}/include -I/usr/local/Cellar/eigen/3.3.5/include/eigen3/eigen -L${MKLROOT}/lib -Wl,-rpath,${MKLROOT}/lib -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl   main.cpp -o my_exec_intel -O3 -fopenmp  -march=native


Execution time: 14.33 +-0.26 ms.  (Editor's note: this answer originally claimed to have used -DMKL_ILP64 which is not supported.  Maybe it used to be, or happened to work.)

Conclusion: 


Matrix-matrix multiplication in Python/Matlab is highly optimized. It is not possible (or, at least, very hard) to do significantly better (on a CPU).
CPP code (at least on MAC platform) can only achieve similar performance if fully optimized, which includes full set of optimization options and Intel mkl libraries.
I could have installed old clang compiler with openmp support, but since the single-thread performance is similar (~46 ms), it looks like this will not help.
It would be great to try it with native Intel compiler icc. Unfortunately, this is proprietary software, unlike Intel mkl libraries.


Thanks for useful discussion,

Mikhail

Edit: For comparison, I've also benchmarked my GTX 980 GPU using cublasDgemm function. Computational time = 12.6 ms, which is compatible with other results. The reason CUDA is almost as good as CPU is the following: my GPU is poorly optimized for doubles. With floats, GPU time =0.43 ms, while Matlab's is 7.2 ms 

Edit 2: to gain significant GPU acceleration, I would need to benchmark matrices with much bigger sizes, e.g. 10k x 10k

Edit 3: changed the interface from MKL_ILP64 to MKL_LP64 since ILP64 is not supported.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复