How to use Fused Multiply-Add (FMA) instructions with SSE/AVX

前端 未结 2 1886
梦如初夏
梦如初夏 2020-11-27 14:21

I have learned that some Intel/AMD CPUs can do simultanous multiply and add with SSE/AVX:
FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2.

I like to

相关标签:
2条回答
  • 2020-11-27 15:13

    The compiler is allowed to fuse a separated add and multiply, even though this changes the final result (by making it more accurate).

    An FMA has only one rounding (it effectively keeps infinite precision for the internal temporary multiply result), while an ADD + MUL has two.

    The IEEE and C standards allow this when #pragma STDC FP_CONTRACT ON is in effect, and compilers are allowed to have it ON by default (but not all do). Gcc contracts into FMA by default (with the default -std=gnu*, but not -std=c*, e.g. -std=c++14). For Clang, it's only enabled with -ffp-contract=fast. (With just the #pragma enabled, only within a single expression like a+b*c, not across separate C++ statements.).

    This is different from strict vs. relaxed floating point (or in gcc terms, -ffast-math vs. -fno-fast-math) that would allow other kinds of optimizations that could increase the rounding error depending on the input values. This one is special because of the infinite precision of the FMA internal temporary; if there was any rounding at all in the internal temporary, this wouldn't be allowed in strict FP.

    Even if you enable relaxed floating-point, the compiler might still choose not to fuse since it might expect you to know what you're doing if you're already using intrinsics.


    So the best way to make sure you actually get the FMA instructions you want is you actually use the provided intrinsics for them:

    FMA3 Intrinsics: (AVX2 - Intel Haswell)

    • _mm_fmadd_pd(), _mm256_fmadd_pd()
    • _mm_fmadd_ps(), _mm256_fmadd_ps()
    • and about a gazillion other variations...

    FMA4 Intrinsics: (XOP - AMD Bulldozer)

    • _mm_macc_pd(), _mm256_macc_pd()
    • _mm_macc_ps(), _mm256_macc_ps()
    • and about a gazillion other variations...
    0 讨论(0)
  • 2020-11-27 15:22

    I tested the following code in GCC 5.3, Clang 3.7, ICC 13.0.1 and MSVC 2015 (compiler version 19.00).

    float mul_add(float a, float b, float c) {
        return a*b + c;
    }
    
    __m256 mul_addv(__m256 a, __m256 b, __m256 c) {
        return _mm256_add_ps(_mm256_mul_ps(a, b), c);
    }
    

    With the right compiler options (see below) every compiler will generate a vfmadd instruction (e.g. vfmadd213ss) from mul_add. However, only MSVC fails to contract mul_addv to a single vfmadd instruction (e.g. vfmadd213ps).

    The following compiler options are sufficient to generate vfmadd instructions (except with mul_addv with MSVC).

    GCC:   -O2 -mavx2 -mfma
    Clang: -O1 -mavx2 -mfma -ffp-contract=fast
    ICC:   -O1 -march=core-avx2
    MSVC:  /O1 /arch:AVX2 /fp:fast
    

    GCC 4.9 will not contract mul_addv to a single fma instruction but since at least GCC 5.1 it does. I don't know when the other compilers started doing this.

    0 讨论(0)
提交回复
热议问题