Faster way to do multi dimensional matrix addition?

前端未结

关注

 2  2039

I have a matrix A of size (m * l * 4) and size of m is around 100,000 and l=100. size of list is always equal to n and n <=m. I wanted to do matrix addition of given list of

相关标签:

2条回答

予麋鹿

2021-01-27 06:01
Exchange loop by i and loop by j in the second part. This will make the function more cache-friendly.
```
for(int j=0;j<n;++j)
{
    for (int i = 0; i < l; i++)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}
```
Also, I hope you did not forget -O3 flag.
0 讨论(0)
发布评论:

提交评论
- 加载中...
臣服心动

2021-01-27 06:18
(update: an earlier version had the indexing wrong. This version auto-vectorizes fairly easily).

Use a C multidimensional array (rather than an array of pointers to pointers), or a flat 1D array indexed with i*cols + j, so the memory is contiguous. This will make a huge difference to the effectiveness of hardware prefetching to make good use of memory bandwidth. Loads with an address coming from another load really suck for performance, or conversely, having predictable addresses known well ahead of time helps a lot because the loads can start well before they're needed (thanks to out-of-order execution).

Also, @user31264's answer is correct, you need to interchange the loops so the loop over j is the outer-most. This is good but nowhere near sufficient on its own.

This will also allow the compiler to auto-vectorize. Actually, I had a surprisingly hard time getting gcc to auto-vectorize well. (But that's probably because I got the indexing wrong, because I only looked at the code the first time. So the compiler didn't know we were looping over contiguous memory.)

I played around with it on the Godbolt compiler explorer.

I finally got good nice compiler output from this version, which takes A and C as flat 1D arrays and does the indexing itself:
```
void MatrixAddition_contiguous(int rows, int n, const  vector<int>& list,
                               const int *__restrict__ A, int *__restrict__ C, int cluster)
  // still auto-vectorizes without __restrict__, but checks for overlap and
  // runs a scalar loop in that case
{
  const int cols = 4;  // or global constexpr or something
  int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols;

  for(int i=0;i<rows;i++)
    //#pragma omp simd  
    for(int k=0;k<4;k++)
      Ccluster[cols*i + k]=0;

  for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop
    const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols;
    // #pragma omp simd    // Doesn't work: only auto-vectorizes with -O3
    // probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop
    for (int i = 0; i < rows; i++) {
      long row_offset = cols*i;
      //#pragma omp simd  // forces vectorization with 16B vectors, so it hurts AVX2
      for(int k=0;k<4;k++)
        Ccluster[row_offset + k] += Alistj[row_offset + k];
    }
  }
}
```
Manually hoisting the list[j] definitely helped the compiler realize that stores into C can't affect the indices that will be loaded from list[j]. Manually hoisting the other stuff probably wasn't necessary.

Hoisting A[list[j]], rather than just list[j], is an artifact of a previous approach where I had the indexing wrong. As long as we hoist the load from list[j] as far as possible, the compiler can do a good job even if it doesn't know that list doesn't overlap C.

The inner loop, with gcc 5.3 targeting x86-64 -O3 -Wall -march=haswell -fopenmp (and -fverbose-asm) is:
```
.L26:
    vmovdqu ymm0, YMMWORD PTR [r8+rax]        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B]
    vpaffffd  ymm0, ymm0, YMMWORD PTR [rdx+rax]   # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B]
    add     r12d, 1   # ivtmp.88,
    vmovdqu YMMWORD PTR [r8+rax], ymm0        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75
    add     rax, 32   # ivtmp.91,
    cmp     r12d, r9d # ivtmp.88, bnd.66
    jb      .L26        #,
```
So it's doing eight adds at once, with AVX2 vpaffffd, with unaligned loads and an unaligned store back into C.

Since this is auto-vectorizing, it should make good code with ARM NEON, or PPC Altivec, or anything that can do packed 32bit addition.

I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2, but clang's -Rpass-analysis=loop-vectorize was slightly helpful.
0 讨论(0)
发布评论:

提交评论
- 加载中...