OpenMP parallelizing matrix multiplication by a triple for loop (performance issue)

后端 未结 2 1759
情歌与酒
情歌与酒 2021-02-06 08:45

I\'m writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x

相关标签:
2条回答
  • 2021-02-06 09:06

    You could probably have some dependencies in the data when you parallelize the outer loop and compiler is not able to figure it out and adds additional locks.

    Most probably it decides that different outer loop iterations could write into the same (C+(i*Nu+j)) and it adds access locks to protect it.

    Compiler could probably figure out that there are no dependencies if you'll parallelize the 2nd loop. But figuring out that there are no dependencies parallelizing the outer loop is not so trivial for a compiler.

    UPDATE

    Some performance measurements.

    Hi again. It looks like 1000 double * and + is not enough to cover the cost of threads synchronization.

    I've done few small tests and simple vector scalar multiplication is not effective with openmp unless the number of elements is less than ~10'000. Basically, larger your array is, more performance will you get from using openmp.

    So parallelizing the most inner loop you'll have to separate task between different threads and gather data back 1'000'000 times.

    PS. Try Intel ICC, it is kinda free to use for students and open source projects. I remember being using openmp for smaller that 10'000 elements arrays.

    UPDATE 2: Reduction example

        double sum = 0.0;
        int k=0;
        double *al = A+i*Nu;
        double *bl = A+j*Nu;
        #pragma omp parallel for shared(al, bl) reduction(+:sum)
        for(k=0;k<Nu ;k++){
            sum +=al[k] * bl[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
        }
        C[i*Nu+j] = sum;
    
    0 讨论(0)
  • 2021-02-06 09:15

    Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

    Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.

    Try:

    for (i=0; i<Nu; i++){
      const double* const Arow = A + i*Nu;
      double* const Crow = C + i*Nu;
    #pragma omp parallel for
      for (j=0; j<Nu; j++){
        const double* const Bcol = B + j*Nu;
        double sum = 0.0;
        for(k=0;k<Nu ;k++){
          sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
        }
        Crow[j] = sum;
      }
    }
    

    Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.

    0 讨论(0)
提交回复
热议问题