Calculating matrix product is much slower with SSE than with straight-forward-algorithm

后端 未结 2 1581
滥情空心
滥情空心 2021-01-03 05:40

I want to multiply two matrices, one time by using the straight-forward-algorithm:

template 
void multiplicate_straight(T ** A, T ** B, T *         


        
相关标签:
2条回答
  • 2021-01-03 06:09

    You have the right idea by taking the transpose in the scalar code but you don't want exactly the transpose when using SSE.

    Let's stick to float (SGEMM). What you want to do with SSE is do four dot products at once. You want C = A*B. Let's look at a 8x8 matrix. Let's assume B is:

    (0   1  2  3) ( 4  5  6  7)
    (8   9 10 11) (12 13 14 15) 
    (16 17 18 19) (20 21 22 23)
    (24 25 26 27) (28 29 30 31)
    (32 33 34 35) (36 37 38 39)
    (40 41 42 43) (44 45 46 47)
    (48 49 50 51) (52 53 54 55)
    (56 57 58 59) (60 61 62 63)
    

    So with SSE you do:

    C[0][0] C[0][1] C[0][2] C[0][3] = 
    A[0][0]*(0 1 2 3) + A[0][1]*(8 9 10 11) + A[0][2]*(16 17 18 19)...+ A[0][7]*(56 57 58 59)
    

    That gets you four dot products at once. The problem is that you have to move down a column in B and the values are not in the same cache line. It would be better if each column of width four was contiguous in memory. So instead of taking the transpose of each element you transpose strips with a width of 4 like this:

    (0  1  2  3)( 8  9 10 11)(16 17 18 19)(24 25 26 27)(32 33 34 35)(40 41 42 43)(48 49 50 51)(56 57 58 59)
    (4  5  6  7)(12 13 14 15)(20 21 22 23)(28 29 30 31)(36 37 38 39)(44 45 46 47)(52 53 54 55)(60 61 62 63)
    

    If you think of each of the four values in parentheses as one unit this is equivalent to transposing a 8x2 matrix to a 2x8 matrix. Notice now that the columns of width four of B are contiguous in memory. This is far more cache friendly. For an 8x8 matrix this is not really an issue but for example with a 1024x1024 matrix it is. See the code below for how to do this. For AVX you transpose strips of width 8 (which means you have nothing to do for a 8x8 matrix). For double the width is two with SSE and four with AVX.

    This should be four times faster than the scalar code assuming the matrices fit in the cache. However, for large matrices this method is still going to be memory bound and so your SSE code may not be much faster than scalar code (but it should not be worse).

    However, if you use loop tiling and rearranging the matrix in tiles (which fit in the L2 cache) rather than for the whole matrix matrix multiplication is computation bound and not memory bound even for very large matrices that don't fit in the L3 cache. That's another topic.

    Edit: some (untested) code to compare to your scalar code. I unrolled the loop by 2.

    void SGEMM_SSE(const float *A, const float *B, float *C, const int sizeX) {
        const int simd_width = 4;
        const int unroll = 2;
        const int strip_width = simd_width*unroll
        float *D = (float*)_mm_malloc(sizeof(float)*sizeX*sizeX, 16);
        transpose_matrix_strip(B, D,sizeX, strip_width); //tranpose B in strips of width eight
        for(int i = 0; i < sizeX; i++) {
            for(int j = 0; j < sizeX; j+=strip_width) {
                float4 out_v1 = 0; //broadcast (0,0,0,0)
                float4 out_V2 = 0;
                //now calculate eight dot products
                for(int g = 0; g < sizeX; g++) {
                    //load eight values rrom D into two SSE registers
                    float4 vec4_1.load(&D[j*sizeX + strip_width*g]);
                    float4 vec4_2.load(&D[j*sizeX + strip_width*g + simd_width]);
                    out_v1 += A[i][g]*vec4_v1;
                    out_v2 += A[i][g]*vec4_v2;
                }
                //store eight dot prodcuts into C
                out_v1.store(&C[i*sizeX + j]);
                out_v2.store(&C[i*sizeX + j + simd_width]);
            }
        }
        _mm_free(D);
    }
    
    void transpose_matrix_strip(const float* A, float* B, const int N, const int strip_width) {
        //#pragma omp parallel for
        for(int n=0; n<N*N; n++) {
            int k = strip_width*(n/N/strip_width);
            int i = (n/strip_width)%N;
            int j = n%strip_width;
            B[n] = A[N*i + k + j];
        }
    }
    

    Notice that j increments by eight now. More unrolling may help. If you want to use intrinsics you can use _mm_load_ps, _mm_store_ps, _mm_set1_ps (for the broadcasts e.g. _mm_set1_ps(A[i][g])), _mm_add_ps, and _mm_mul_ps. That's it.

    0 讨论(0)
  • 2021-01-03 06:16

    I believe this should do the same thing as the first loop with SSE, assuming sizeX is a multiple of two and the memory is 16-byte aligned.

    You may gain a bit more performance by unrolling the loop and using multiple temp variables which you add together at the end. You could also try AVX and the new Fused Multiply Add instruction.

    template <typename T>
    void multiplicate_SSE2(T ** A, T ** B, T ** C, int sizeX)
    {
        T ** D = AllocateDynamicArray2D<T>(sizeX, sizeX);
        transpose_matrix(B, D,sizeX);
        for(int i = 0; i < sizeX; i++)
        {
            for(int j = 0; j < sizeX; j++)
            {
                __m128d temp = _mm_setzero_pd();
                for(int g = 0; g < sizeX; g += 2)
                {
                    __m128d a = _mm_load_pd(&A[i][g]);
                    __m128d b = _mm_load_pd(&D[j][g]);
                    temp = _mm_add_pd(temp, _mm_mul_pd(a,b));
                }
                // Add top and bottom half of temp together
                temp = _mm_add_pd(temp, _mm_shuffle_pd(temp, temp, 1));
                _mm_store_sd(temp, &C[i][j]); // Store one value
            }
        }
        FreeDynamicArray2D<T>(D);
    }
    
    0 讨论(0)
提交回复
热议问题