Optimized matrix multiplication in C

后端未结

关注

 13  2195

I\'m trying to compare different methods for matrix multiplication. The first one is normal method:

do
{
    for (j = 0; j < i; j++)
    {
        for (k


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-11-30 02:39
              
            
            
                                                                       
The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光取名叫无心        
                
              
                            
                2020-11-30 02:39
              
            
            
                                                                       
Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:

for (int i = 0; i < N; i++)
    for (int k = 0; k < N; k++)
        for (int j = 0; j < N; j++)
            if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
                C[i][j] += A[i][k] * B[k][j];
            else
                C[i][j] = A[i][k] * B[k][j];


This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-11-30 02:42
              
            
            
                                                                       
How big improvements you get will depend on:


The size of the cache
The size of a cache line
The degree of associativity of the cache


For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感动是毒        
                
              
                            
                2020-11-30 02:42
              
            
            
                                                                       
not so special but better :

    c = 0;
do
{
    for (j = 0; j < i; j++)
    {
        for (k = 0; k < i; k++)
        {
            sum = 0; sum_ = 0;
            for (l = 0; l < i; l++) {
                MatrixB[j][k] = MatrixB[k][j];
                sum += MatrixA[j][l]*MatrixB[k][l];
                l++;
                MatrixB[j][k] = MatrixB[k][j];
                sum_ += MatrixA[j][l]*MatrixB[k][l];

                sum += sum_;
            }
            MatrixR[j][k] = sum;
        }
     }
     c++;
} while (c<iteraciones);

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2020-11-30 02:44
              
            
            
                                                                       
Can you post some data comparing your 2 approaches for a range of matrix sizes ?  It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.

Don't forget, when measuring execution time, to include the time to transpose matrixB.

Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library.  This may not answer your question directly, but it will give you a better idea of what you might expect from your code.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-11-30 02:45
              
            
            
                                                                       
Very old question, but heres my current implementation for my opengl projects:

typedef float matN[N][N];

inline void matN_mul(matN dest, matN src1, matN src2)
{
    unsigned int i;
    for(i = 0; i < N^2; i++)
    {
        unsigned int row = (int) i / 4, col = i % 4;
        dest[row][col] = src1[row][0] * src2[0][col] +
                         src1[row][1] * src2[1][col] +
                         ....
                         src[row][N-1] * src3[N-1][col];
    }
}


Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:

typedef float mat4[4][4];    

inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
{
    unsigned int i;
    for(i = 0; i < 16; i++)
    {
        unsigned int row = (int) i / 4, col = i % 4;
        dest[row][col] = src1[row][0] * src2[0][col] +
                         src1[row][1] * src2[1][col] +
                         src1[row][2] * src2[2][col] +
                         src1[row][3] * src2[3][col];
    }
}


This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function. 

Cons:


Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)
Tweaks needed for irregular multiplication (ex. mat3 x mat4).....

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复