I have a matrix A of size (m * l * 4) and size of m is around 100,000 and l=100. size of list is always equal to n and n <=m. I wanted to do matrix addition of given list of
Exchange loop by i and loop by j in the second part. This will make the function more cache-friendly.
for(int j=0;j<n;++j)
{
for (int i = 0; i < l; i++)
{
for(int k=0;k<4;k++)
C[cluster][i][k]+=A[list[j]][i][k];
}
}
Also, I hope you did not forget -O3 flag.
(update: an earlier version had the indexing wrong. This version auto-vectorizes fairly easily).
Use a C multidimensional array (rather than an array of pointers to pointers), or a flat 1D array indexed with i*cols + j
, so the memory is contiguous. This will make a huge difference to the effectiveness of hardware prefetching to make good use of memory bandwidth. Loads with an address coming from another load really suck for performance, or conversely, having predictable addresses known well ahead of time helps a lot because the loads can start well before they're needed (thanks to out-of-order execution).
Also, @user31264's answer is correct, you need to interchange the loops so the loop over j
is the outer-most. This is good but nowhere near sufficient on its own.
This will also allow the compiler to auto-vectorize. Actually, I had a surprisingly hard time getting gcc to auto-vectorize well. (But that's probably because I got the indexing wrong, because I only looked at the code the first time. So the compiler didn't know we were looping over contiguous memory.)
I played around with it on the Godbolt compiler explorer.
I finally got good nice compiler output from this version, which takes A and C as flat 1D arrays and does the indexing itself:
void MatrixAddition_contiguous(int rows, int n, const vector<int>& list,
const int *__restrict__ A, int *__restrict__ C, int cluster)
// still auto-vectorizes without __restrict__, but checks for overlap and
// runs a scalar loop in that case
{
const int cols = 4; // or global constexpr or something
int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols;
for(int i=0;i<rows;i++)
//#pragma omp simd
for(int k=0;k<4;k++)
Ccluster[cols*i + k]=0;
for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop
const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols;
// #pragma omp simd // Doesn't work: only auto-vectorizes with -O3
// probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop
for (int i = 0; i < rows; i++) {
long row_offset = cols*i;
//#pragma omp simd // forces vectorization with 16B vectors, so it hurts AVX2
for(int k=0;k<4;k++)
Ccluster[row_offset + k] += Alistj[row_offset + k];
}
}
}
Manually hoisting the list[j]
definitely helped the compiler realize that stores into C
can't affect the indices that will be loaded from list[j]
. Manually hoisting the other stuff probably wasn't necessary.
Hoisting A[list[j]]
, rather than just list[j]
, is an artifact of a previous approach where I had the indexing wrong. As long as we hoist the load from list[j]
as far as possible, the compiler can do a good job even if it doesn't know that list
doesn't overlap C
.
The inner loop, with gcc 5.3 targeting x86-64 -O3 -Wall -march=haswell -fopenmp
(and -fverbose-asm
) is:
.L26:
vmovdqu ymm0, YMMWORD PTR [r8+rax] # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B]
vpaffffd ymm0, ymm0, YMMWORD PTR [rdx+rax] # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B]
add r12d, 1 # ivtmp.88,
vmovdqu YMMWORD PTR [r8+rax], ymm0 # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75
add rax, 32 # ivtmp.91,
cmp r12d, r9d # ivtmp.88, bnd.66
jb .L26 #,
So it's doing eight adds at once, with AVX2 vpaffffd, with unaligned loads and an unaligned store back into C.
Since this is auto-vectorizing, it should make good code with ARM NEON, or PPC Altivec, or anything that can do packed 32bit addition.
I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2, but clang's -Rpass-analysis=loop-vectorize was slightly helpful.