I\'m writing a program for matrix multiplication with OpenMP, that, for cache convenience, implements the multiplication A x B(transpose) rows X rows instead of the classic A x
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
Also, use of restrict
may help. Otherwise the compiler can't guarantee that writes to C
aren't changing A
and B
.
Try:
for (i=0; i
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.