Using AVX instructions disables exp() optimization?

陌路散爱 提交于 2019-12-03 14:42:26

If you use any AVX256 instruction, the "AVX upper state" becomes "dirty", which results in a large stall if you subsequently use SSE instructions (including scalar floating-point performed in the xmm registers). This is documented in the Intel Optimization Manual, which you can download for free (and is a must-read if you're doing this sort of work):

AVX instruction always modifies the upper bits of YMM registers and SSE instructions do not modify the upper bits. From a hardware perspective, the upper bits of the YMM register collection can be considered to be in one of three states:

• Clean: All upper bits of YMM are zero. This is the state when processor starts from RESET.

• Modified and saved to XSAVE region The content of the upper bits of YMM registers matches saved data in XSAVE region. This happens when after XSAVE/XRSTOR executes.

• Modified and Unsaved: The execution of one AVX instruction (either 256-bit or 128-bit) modifies the upper bits of the destination YMM.

The AVX/SSE transition penalty applies whenever the processor states is “Modified and Unsaved“. Using VZEROUPPER move the processor states to “Clean“ and avoid the transition penalty.

Your routine B( ) dirties the YMM state, so the SSE code in A( ) stalls. Insert a VZEROUPPER instruction between B and A to avoid the problem.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!