Calculating matrix product is much slower with SSE than with straight-forward-algorithm

后端未结

关注

 2  1583

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template 
void multiplicate_straight(T ** A, T ** B, T *


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-01-03 06:09
              
            
            
                                                                       
You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.  

Let's stick to float (SGEMM).  What you want to do with SSE is do four dot products at once.  You want C = A*B.  Let's look at a 8x8 matrix.  Let's assume B is:

(0   1  2  3) ( 4  5  6  7)
(8   9 10 11) (12 13 14 15) 
(16 17 18 19) (20 21 22 23)
(24 25 26 27) (28 29 30 31)
(32 33 34 35) (36 37 38 39)
(40 41 42 43) (44 45 46 47)
(48 49 50 51) (52 53 54 55)
(56 57 58 59) (60 61 62 63)


So with SSE you do:

C[0][0] C[0][1] C[0][2] C[0][3] = 
A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)


That gets you four dot products at once.  The problem is that you have to move down a column in B and the values are not in the same cache line.  It would be better if each column of width four was contiguous in memory.  So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

(0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
(4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)


If you think of each of the four values in parentheses as one unit this is  equivalent to transposing a 8x2 matrix to a 2x8 matrix.  Notice now that the columns of width four of B are contiguous in memory.  This is far more cache friendly.  For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is.  See the code below for how to do this.  For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix).  For double the width is two with SSE and four with AVX.

This should be four times faster than the scalar code assuming the matrices fit in the cache.  However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

However, if you use loop tiling and rearranging the matrix in tiles (which fit in the L2 cache) rather than for the whole matrix matrix multiplication is computation bound and not memory bound even for very large matrices that don't fit in the L3 cache.  That's another topic.

Edit: some (untested) code to compare to your scalar code.  I unrolled the loop by 2.

void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
    const int simd_width = 4;
    const int unroll = 2;
    const int strip_width = simd_width*unroll
    float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
    transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
    for(int i = 0; i < sizeX; i++) {
        for(int j = 0; j < sizeX; j+=strip_width) {
            float4 out_v1 = 0; //broadcast (0,0,0,0)
            float4 out_V2 = 0;
            //now calculate eight dot products
            for(int g = 0; g < sizeX; g++) {
                //load eight values rrom D into two SSE registers
                float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                out_v1 += A[i][g]*vec4_v1;
                out_v2 += A[i][g]*vec4_v2;
            }
            //store eight dot prodcuts into C
            out_v1.store(&C[i*sizeX + j]);
            out_v2.store(&C[i*sizeX + j + simd_width]);
        }
    }
    _mm_free(D);
}

void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
    //#pragma omp parallel for
    for(int n=0; n<N*N; n++) {
        int k = strip_width*(n/N/strip_width);
        int i = (n/strip_width)%N;
        int j = n%strip_width;
        B[n] = A[N*i + k + j];
    }
}


Notice that j increments by eight now.  More unrolling may help.  If you want to use intrinsics you can use _mm_load_ps, _mm_store_ps, _mm_set1_ps (for the broadcasts e.g. _mm_set1_ps(A[i][g])), _mm_add_ps, and _mm_mul_ps.  That's it.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  夕颜        
                
              
                            
                2021-01-03 06:16
              
            
            
                                                                       
I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.

You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction.

template <typename T>
void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
{
    T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
    transpose_matrix(B, D,sizeX);
    for(int i = 0; i < sizeX; i++)
    {
        for(int j = 0; j < sizeX; j++)
        {
            __m128d temp = _mm_setzero_pd();
            for(int g = 0; g < sizeX; g += 2)
            {
                __m128d a = _mm_load_pd(&A[i][g]);
                __m128d b = _mm_load_pd(&D[j][g]);
                temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
            }
            // Add top and bottom half of temp together
            temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
            _mm_store_sd(temp, &C[i][j]); // Store one value
        }
    }
    FreeDynamicArray2D<T>(D);
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复