Optimized matrix multiplication in C

后端 未结 13 2195
一整个雨季
一整个雨季 2020-11-30 01:44

I\'m trying to compare different methods for matrix multiplication. The first one is normal method:

do
{
    for (j = 0; j < i; j++)
    {
        for (k          


        
相关标签:
13条回答
  • 2020-11-30 02:39

    The computation complexity of multiplication of two N*N matrix is O(N^3). The performance will be dramatically improved if you use O(N^2.73) algorithm which probably has been adopted by MATLAB. If you installed a MATLAB, try to multiply two 1024*1024 matrix. On my computer, MATLAB complete it in 0.7s, but the C\C++ implementation of the naive algorithm like yours takes 20s. If you really care about the performance, refer to lower-complex algorithms. I heard there exists O(N^2.4) algorithm, however it needs a very large matrix so that other manipulations can be neglected.

    0 讨论(0)
  • 2020-11-30 02:39

    Generally speaking, transposing B should end up being much faster than the naive implementation, but at the expense of wasting another NxN worth of memory. I just spent a week digging around matrix multiplication optimization, and so far the absolute hands-down winner is this:

    for (int i = 0; i < N; i++)
        for (int k = 0; k < N; k++)
            for (int j = 0; j < N; j++)
                if (likely(k)) /* #define likely(x) __builtin_expect(!!(x), 1) */
                    C[i][j] += A[i][k] * B[k][j];
                else
                    C[i][j] = A[i][k] * B[k][j];
    

    This is even better than Drepper's method mentioned in an earlier comment, as it works optimally regardless of the cache properties of the underlying CPU. The trick lies in reordering the loops so that all three matrices are accessed in row-major order.

    0 讨论(0)
  • 2020-11-30 02:42

    How big improvements you get will depend on:

    1. The size of the cache
    2. The size of a cache line
    3. The degree of associativity of the cache

    For small matrix sizes and modern processors it's highly probable that the data fron both MatrixA and MatrixB will be kept nearly entirely in the cache after you touch it the first time.

    0 讨论(0)
  • 2020-11-30 02:42

    not so special but better :

        c = 0;
    do
    {
        for (j = 0; j < i; j++)
        {
            for (k = 0; k < i; k++)
            {
                sum = 0; sum_ = 0;
                for (l = 0; l < i; l++) {
                    MatrixB[j][k] = MatrixB[k][j];
                    sum += MatrixA[j][l]*MatrixB[k][l];
                    l++;
                    MatrixB[j][k] = MatrixB[k][j];
                    sum_ += MatrixA[j][l]*MatrixB[k][l];
    
                    sum += sum_;
                }
                MatrixR[j][k] = sum;
            }
         }
         c++;
    } while (c<iteraciones);
    
    0 讨论(0)
  • 2020-11-30 02:44

    Can you post some data comparing your 2 approaches for a range of matrix sizes ? It may be that your expectations are unrealistic and that your 2nd version is faster but you haven't done the measurements yet.

    Don't forget, when measuring execution time, to include the time to transpose matrixB.

    Something else you might want to try is comparing the performance of your code with that of the equivalent operation from your BLAS library. This may not answer your question directly, but it will give you a better idea of what you might expect from your code.

    0 讨论(0)
  • 2020-11-30 02:45

    Very old question, but heres my current implementation for my opengl projects:

    typedef float matN[N][N];
    
    inline void matN_mul(matN dest, matN src1, matN src2)
    {
        unsigned int i;
        for(i = 0; i < N^2; i++)
        {
            unsigned int row = (int) i / 4, col = i % 4;
            dest[row][col] = src1[row][0] * src2[0][col] +
                             src1[row][1] * src2[1][col] +
                             ....
                             src[row][N-1] * src3[N-1][col];
        }
    }
    

    Where N is replaced with the size of the matrix. So if you are multiplying 4x4 matrices, then you use:

    typedef float mat4[4][4];    
    
    inline void mat4_mul(mat4 dest, mat4 src1, mat4 src2)
    {
        unsigned int i;
        for(i = 0; i < 16; i++)
        {
            unsigned int row = (int) i / 4, col = i % 4;
            dest[row][col] = src1[row][0] * src2[0][col] +
                             src1[row][1] * src2[1][col] +
                             src1[row][2] * src2[2][col] +
                             src1[row][3] * src2[3][col];
        }
    }
    

    This function mainly minimizes loops but the modulus might be taxing... On my computer this function performed roughly 50% faster than a triple for loop multiplication function.

    Cons:

    • Lots of code needed (ex. different functions for mat3 x mat3, mat5 x mat5...)

    • Tweaks needed for irregular multiplication (ex. mat3 x mat4).....

    0 讨论(0)
提交回复
热议问题