问题
I was trying to use AVX512 intrinsics to vectorize my loop of matrix multiplication (tiled). I used __mm256d as variables to store intermediate results and store them in my results. However, somehow this triggers memory corruption. I've got no hint why this is the case, as the non-AVX version works fine. Also, another weird thing is that tile sizes somehow affects the result now.
The matrix structs are attached in the following code section. The function takes two matrix pointers, m1 and m2 and an integer for tileSize.Thanks for @harold's feedback, I've now replaced the _mm256_load_pd for matrix m1 with broadcast. However, the memory corrupution problem still persist. I've also attached the output of memory corruption below
__m256d rResult rm1, rm2, rmult;
for (int bi = 0; bi < result->row; bi += tileSize) {
for (int bj = 0; bj < result->col; bj += tileSize) {
for (int bk = 0; bk < m1->col; bk += tileSize) {
for (int i = 0; i < tileSize; i++ ) {
for (int j = 0; j < tileSize; j+=4) {
rResult = _mm256_setzero_pd();
for (int k = 0; k < tileSize; k++) {
// result->val[bi+i][bj+j] += m1.val[bi+i][bk+k]*m2.val[bk+k][bj+j];
rm1 = _mm256_broadcast_pd((__m128d const *) &m1->val[bi+i][bk+k]);
rm2 = _mm256_load_pd(&m2->val[bk+k][bj+j]);
rmult = _mm256_mul_pd(rm1,rm2);
rResult = _mm256_add_pd(rResult,rmult);
_mm256_store_pd(&result->val[bi+i][bj+j],rResult);
}
}
}
}
}
}
return result;
*** Error in `./matrix': free(): invalid next size (fast): 0x0000000001880910 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81609)[0x2b04a26d0609]
./matrix[0x4016cc]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b04a2671495]
./matrix[0x400e29]
======= Memory map: ========
00400000-0040c000 r-xp 00000000 00:2c 6981358608 /home/matrix
0060b000-0060c000 r--p 0000b000 00:2c 6981358608 /home/matrix
0060c000-0060d000 rw-p 0000c000 00:2c 6981358608 /home/matrix
01880000-018a1000 rw-p 00000000 00:00 0 [heap]
2b04a1f13000-2b04a1f35000 r-xp 00000000 00:16 12900 /usr/lib64/ld-2.17.so
2b04a1f35000-2b04a1f3a000 rw-p 00000000 00:00 0
2b04a1f4e000-2b04a1f52000 rw-p 00000000 00:00 0
2b04a2134000-2b04a2135000 r--p 00021000 00:16 12900 /usr/lib64/ld-2.17.so
2b04a2135000-2b04a2136000 rw-p 00022000 00:16 12900 /usr/lib64/ld-2.17.so
2b04a2136000-2b04a2137000 rw-p 00000000 00:00 0
2b04a2137000-2b04a2238000 r-xp 00000000 00:16 13188 /usr/lib64/libm-2.17.so
2b04a2238000-2b04a2437000 ---p 00101000 00:16 13188 /usr/lib64/libm-2.17.so
2b04a2437000-2b04a2438000 r--p 00100000 00:16 13188 /usr/lib64/libm-2.17.so
2b04a2438000-2b04a2439000 rw-p 00101000 00:16 13188 /usr/lib64/libm-2.17.so
2b04a2439000-2b04a244e000 r-xp 00000000 00:16 12867 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a244e000-2b04a264d000 ---p 00015000 00:16 12867 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264d000-2b04a264e000 r--p 00014000 00:16 12867 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264e000-2b04a264f000 rw-p 00015000 00:16 12867 /usr/lib64/libgcc_s-4.8.5-20150702.so.1
2b04a264f000-2b04a2811000 r-xp 00000000 00:16 13172 /usr/lib64/libc-2.17.so
2b04a2811000-2b04a2a11000 ---p 001c2000 00:16 13172 /usr/lib64/libc-2.17.so
2b04a2a11000-2b04a2a15000 r--p 001c2000 00:16 13172 /usr/lib64/libc-2.17.so
2b04a2a15000-2b04a2a17000 rw-p 001c6000 00:16 13172 /usr/lib64/libc-2.17.so
2b04a2a17000-2b04a2a1c000 rw-p 00000000 00:00 0
2b04a2a1c000-2b04a2a1e000 r-xp 00000000 00:16 13184 /usr/lib64/libdl-2.17.so
2b04a2a1e000-2b04a2c1e000 ---p 00002000 00:16 13184 /usr/lib64/libdl-2.17.so
2b04a2c1e000-2b04a2c1f000 r--p 00002000 00:16 13184 /usr/lib64/libdl-2.17.so
2b04a2c1f000-2b04a2c20000 rw-p 00003000 00:16 13184 /usr/lib64/libdl-2.17.so
2b04a4000000-2b04a4021000 rw-p 00000000 00:00 0
2b04a4021000-2b04a8000000 ---p 00000000 00:00 0
7ffc8448e000-7ffc844b1000 rw-p 00000000 00:00 0 [stack]
7ffc845ed000-7ffc845ef000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Aborted
回答1:
That code loads a small row vector from m1 and a small row vector from m2 and multiplies them, which is not how matrix multiplication works, I assume it's direct vectorization of the identical scalar loop. You can use a broadcast-load from m1, that way the product with the row vector from m2 results in a row vector of the result which is convenient (the other way around, broadcasting from m2, you get a column vector of the result which is tricky to store - unless of course you use the column-major matrix layout).
Never resetting rResult
is also wrong, and takes extra care when using tiling, because the tiling means that individual results are put aside and then picked up again later. It's convenient to implement C += A*B
because then you don't have to distinguish between the second time that a result is worked on (loading rResult
back out of the result matrix) and the first time that a result is worked on (either zeroing the accumulator, or if you implement C += A*B
, then it's also just loading it out of the result).
There are some performance bugs,
- Using one accumulator. This limits the inner loop to run once every 4 cycles (Skylake) in the long term, because of the loop-carried dependency through the addition (or FMA). There should be 2 FMAs per cycle but that way there would be one FMA every 4 cycles, 1/8th speed.
- Using a 2:1 load-to-FMA ratio (assuming the mul+add is contracted), it needs to be 1:1 or better to avoid getting bottlenecked by load throughput. A 2:1 ratio is limited to half speed.
The solution for both of them is multiplying a small column vector from m1 with a small row vector from m2 in the inner loop, summing into a small matrix of accumulators rather than just one of them. For example if you use a 3x16 region (3x4 vectors, with a vector length of 4 and the vectors corresponding to loads from m2, from m1 you would do broadcast-loads), then there are 12 accumulators, and therefore 12 independent dependency chains: enough to hide the high latency-throughput product of FMA (2 per cycle, but 4 cycles long on Skylake, so you need at least 8 independent chains, and at least 10 on Haswell). It also means there are 7 loads and 12 FMAs in the inner loop, even better than 1:1, it can even support Turbo frequencies without overclocking the cache.
I would also like to note that setting the tile size the same in every dimension is not necessarily the best. Maybe it is, but probably not, the dimensions do act a little differently.
More advanced performance issue,
- Not re-packing tiles. This means tiles will span more pages than necessary, which hurts the effectiveness of the TLB. You can easily get into a situation where your tiles fit in the cache, but are spread over too many pages to fit in the TLB. TLB thrashing is not good.
Using asymmetric tile sizes you can arrange for either m1 tiles or m2 tiles to be TLB-friendly, but not both at the same time.
回答2:
If you care about performance, normally you want one contiguous chunk of memory, not an array of pointers to rows.
Anyway, you're probably reading off the end of a row if your tile size isn't a multiple of 4 doubles per vector. Or if your rows or cols aren't a multiple of the tile size, then you need to stop after the last full tile, and write cleanup code for the end.
e.g. bi < result->row - (tileSize-1)
for the outer loops
If your tile size isn't a multiple of 4, then you'd also need i < tileSize-3
. But hopefully you are doing power-of-2 loop tiling / cache blocking. But you'd want a size - 3
boundary for vector cleanup in a partial tile. Then probably scalar cleanup for the last few elements. (Or if you can use an unaligned final vector that ends at the end of a row, that can work, maybe with masked loads/stores. But trickier for matmul than for algorithms that just make a single pass.)
来源:https://stackoverflow.com/questions/58160897/avx-intrinsics-for-tiled-matrix-multiplication