How to write c++ code that the compiler can efficiently compile to SSE or AVX?

早过忘川 提交于 2019-12-29 07:06:10

问题


Let's say I have a function written in c++ that performs matrix vector multiplications on a lot of vectors. It takes a pointer to the array of vectors to transform. Am I correct to assume that the compiler cannot efficiently optimize that to SIMD instructions because it does not know the alignment of the passed pointer (requiring a 16 byte alignment for SSE or 32 byte alignment for AVX) at compile time? Or is the memory alignment of the data irrelevant for optimal SIMD code and the data alignment will only affect cache performance?

If alignment is important for the generated code, how can I let the (visual c++) compiler know that I intend to only pass values with a certain alignment to the function?


回答1:


In theory alignment should not matter on Intel processors since Nehalem. Therefore, your compiler should be able to produce code in which a pointer being aligned or not is not an issue.

Unaligned load/store instructions have the same performance on Intel processors since Nehalem. However, until AVX arrived with Sandy Bridge unaligned loads could not be folded with another operation for micro-op fusion.

Additionally, even before AVX to avoid the penalty of cache line splits having 16 byte aligned memory could still be helpful so it would still be reasonable for a compiler to add code until the pointer is 16 byte aligned.

Since AVX there is no advantage to using aligned load/store instructions anymore and there is no reason for a compiler to add code to make a pointer 16 byte or 32 byte aligned..

However, there is till a reason to use aligned memory to avoid cache-line splits with AVX. Therefore, it would would be reasonable for a compiler to add code to make the pointer 32 byte aligned even if it still used an unaligned load instruction.

So in practice some compilers produce much simpler code when they are told to assume that a pointer is aligned.

I'm not aware of a method to tell MSVC that a pointer is aligned. With GCC and Clang (since 3.6) you can use a built in __builtin_assume_aligned. With ICC and also GCC you can use #pragma omp simd aligned. With ICC you can also use __assume_aligned.

For example with GCC compiling this simple loop

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 16);
    //b = (float*)__builtin_assume_aligned (b, 16);
    for(int i=0; i<(n & (-4)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

with gcc -O3 -march=nehalem -S test.c and then wc test.s gives 160 lines. Whereas if use __builtin_assume_aligned then wc test.s gives only 45 lines. When I did this with in both cases clang return 110 lines.

So on clang informing the compiler the arrays were aligned made no difference (in this case) but with GCC it did. Counting lines of code is not a sufficient metric to gauge performance but I'm not going to post all the assembly here I just want to illustrate that your compiler may produce very different code when it is told the arrays are aligned.

Of course, the additional overhead that GCC has for not assuming the arrays are aligned may make no difference in practice. You have to test and see.


In any case, if you want to get the most most from SIMD I would not rely on the compiler to do it correctly (especially with MSVC). Your example of matrix*vector is a poor one in general (but maybe not for some special cases) since it's memory bandwidth bound. But if you choose matrix*matrix no compiler is going to optimize that well without a lot of help which does not conform to the C++ standard. In these cases you will need intrinsics/built-ins/assembly in which you have explicit control of the alignment anyway.


Edit:

The assembly from GCC contains a lot of extraneous lines which are not part of the text segment. Doing gcc -O3 -march=nehalem -S test.c and then using objdump -d and counting the lines in the text (code) segment gives 108 lines without using __builtin_assume_aligned and only 16 lines with it. This shows more clearly that GCC produces very different code when it assumes the arrays are aligned.


Edit:

I went ahead and tested the foo function above in MSVC 2013. It produces unaligned loads and the code is much shorter than GCC (I only show the main loop here):

$LL3@foo:
    movsxd  rax, r9d
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rax*4]
    vmovups XMMWORD PTR [r11+rax*4], xmm1
    lea eax, DWORD PTR [r9+4]
    add r9d, 8
    movsxd  rcx, eax
    vmulps  xmm1, xmm0, XMMWORD PTR [r10+rcx*4]
    vmovups XMMWORD PTR [r11+rcx*4], xmm1
    cmp r9d, edx
    jl  SHORT $LL3@foo

This should be fine on processors since Nehalem (late 2008). But MSVC still has cleanup code for arrays that are not a multiple of four even thought I told the compiler that it was a multiple of four ((n & (-4)). At least GCC gets that right.


Since AVX can fold unalinged loads I checked GCC with AVX to see if the code was the same.

void foo(float * __restrict a, float * __restrict b, int n)
{
    //a = (float*)__builtin_assume_aligned (a, 32);
    //b = (float*)__builtin_assume_aligned (b, 32);
    for(int i=0; i<(n & (-8)); i++) {
        b[i] = 3.14159f*a[i];
    }
}

without __builtin_assume_aligned GCC produces 168 lines of assembly and with it it only produces 17 lines.




回答2:


My original answer became too messy to edit so I am adding a new answer here and making my original answer community wiki.

I did some tests using aligned and unaligned memory on a pre Nehalem system and on a Haswell system with GCC, Clang, and MSVC.

The assembly shows that only GCC adds code to check and fix alignment. Due to this with __builtin_assume_aligned GCC produces much simpler code. But using __builtin_assume_aligned with Clang only changes unaligned instructions to aligned (the number of instructions stay the same). MSVC just uses unaligned instructions.

The results in performance is that on per-Nehalem systems Clang and MSVC are much slower than GCC with auto-vectorization when the memory is not aligned.

But the penalty for cache-line splits is small since Nehalem. It turns out the extra code GCC adds to check and align the memory more than makes up for the small penalty due to cache-line splits. This explains why neither Clang nor MSVC worry about cache-line splits with vectorization.

So my original claim that auto-vecorization does not need to know about the alignment is more or less correct since Nehalem. That's not the same thing as saying that aligning memory is not useful since Nehalem.



来源:https://stackoverflow.com/questions/33504003/how-to-write-c-code-that-the-compiler-can-efficiently-compile-to-sse-or-avx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!