How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

扶醉桌前 提交于 2019-11-26 21:56:21
Mysticial

The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.


So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

FMA3 Intrinsics: (AVX2 - Intel Haswell)

  • _mm_fmadd_pd(), _mm256_fmadd_pd()
  • _mm_fmadd_ps(), _mm256_fmadd_ps()
  • and about a gazillion other variations...

FMA4 Intrinsics: (XOP - AMD Bulldozer)

  • _mm_macc_pd(), _mm256_macc_pd()
  • _mm_macc_ps(), _mm256_macc_ps()
  • and about a gazillion other variations...

I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

float mul_add(float a, float b, float c) {
    return a*b + c;
}

__m256 mul_addv(__m256 a, __m256 b, __m256 c) {
    return _mm256_add_ps(_mm256_mul_ps(a, b), c);
}

With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).

The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).

GCC:   -O2 -mavx2 -mfma
Clang: -O1 -mavx2 -mfma -ffp-contract=fast
ICC:   -O1 -march=core-avx2
MSVC:  /O1 /arch:AVX2 /fp:fast

GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!