FMA3 in GCC: how to enable

核能气质少年 提交于 2019-11-30 00:11:36

Only answering a very small part of the question here. If you write _mm256_add_ps(_mm256_mul_ps(areg0,breg0), tmp0), gcc-4.9 handles it almost like inline asm and does not optimize it much. If you replace it with areg0*breg0+tmp0, a syntax that is supported by both gcc and clang, then gcc starts optimizing and may use FMA if available. I improved that for gcc-5, _mm256_add_ps for instance is now implemented as an inline function that simply uses +, so the code with intrinsics can be optimized as well.

Z boson

The following compiler options are sufficient to contract _mm256_add_ps(_mm256_mul_ps(a, b), c) to a single fma instruction now (e.g vfmadd213ps):

GCC 5.3:   -O2 -mavx2 -mfma
Clang 3.7: -O1 -mavx2 -mfma -ffp-contract=fast
ICC 13:    -O1 -march=core-avx2

I tried /O2 /arch:AVX2 /fp:fast with MSVC but it still does not contract (surprise surprise). MSVC will contract scalar operations though.

GCC started doing this since at least GCC 5.1.


Although -O1 is sufficient for this optimization with some compilers, always use at least -O2 for overall performance, preferably -O3 -march=native -flto and also profile-guided optimization.

And if it's ok for your code, -ffast-math.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!