Does SSE/AVX provide a means of determining if a result was rounded up?

≡放荡痞女 提交于 2020-12-09 12:20:55

问题


One of the purposes of the C1 bit in the x87 FPU status word is to show whether or not an inexact result was rounded up.

Does SSE/AVX provide any such indication for scalar operations?

I did not see a similar bit in the MXCSR register. Am I forced to use x87 instructions if I want this information?


回答1:


SSE/AVX do not provide hardware support for detecting this, even for scalar instructions like addss. SSE was designed for SIMD, with 4 floats per XMM vector, and presumably Intel didn't want to provide a bitmap of 4 bits in the MXCSR. Although that would have been a possible design choice.

As @Mysticial points out in comments, it can be possible to calculate it using extra instructions.


(Untested idea that might do what you want. I think this should work even with subnormals and so on; compare for exact equality is the same as bitwise compare except for -0.0 == +0.0, or for NaN)

With AVX512, you might do your add/sub/mul/div/sqrt calculation normally (with default rounding), then again with a rounding-mode override to truncation towards 0. Use vcmpps for equality on the results. The elements that compare exactly equal were rounded toward 0 by the default rounding mode (or were exact both times). Of course you could use towards -Inf of towards +Inf as your override to detect that instead of toward 0.

AVX512's EVEX prefix can encode a rounding mode override on a per-instruction basis, without changing MXCSR. This makes it efficiently possible to do this, significantly more efficiently than changing MXCSR. e.g. _mm512_add_round_ps (__m512 a, __m512 b, int);. Note that AVX512 embedded-rounding (er) is only available for 512-bit vectors; you unfortunately can't use it with AVX512VL to do rounding overrides on 256-bit vectors to avoid the current max-turbo and other downsides of using 512-bit vectors on current Skylake-family CPUs. Using ER also applies SAE (suppress-all-exceptions), meaning the instruction doesn't have to update MXCSR at all. AVX-512 Instruction Encoding - {er} Meaning.

In asm syntax, rz = round toward zero. See Table 2-36. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions in Intel's vol.2 x86 manual.

    vaddpd     zmm2, zmm1, zmm0          ; no override, or {rne-sae} would be Nearest-Even

    vaddpd     zmm3, zmm1, zmm0, {rz-sae}  ; rounding = truncation toward Zero
    vcmpneqpd  k1, zmm2, zmm3              ; compare for not-equal
   ;;; k1 = bitmask
       ;;  0 means rounded toward 0 or exact
       ;;  1 means rounded away from 0

If you don't need the primary result to be a 512-bit vector, you can do that and the compare with XMM or YMM registers, but the {rz-sae} operation has to be ZMM. YMM compare gives you the option of comparing into another YMM register (AVX1) instead of into an AVX512 mask register. But if you're using AVX512, mask registers are usually pretty nice.

This always needs 2 extra instructions: repeating the operation and a compare. Mysticial's suggestion to use an FMA after mulps might avoid that, if you just use the sign bit directly instead of comparing against zero. e.g. vmovmskps to get an integer bitmap, or vxorps or vandps to combine some vectors where the "truth value" you care about is the sign bit. This might be an input for vblendvps (which also only looks at sign bits), or for an eventual vmovmskps.


Changing the rounding mode without AVX512 might not be a total disaster, especially if you can do a few vectors with default before changing to truncation and redoing them. That might make it more efficient than a rounding-direction-detection sequence that took 3 or more instructions per vector if you have enough registers to play with to amortize the MXCSR changes over enough operations.

Apparently some Intel CPUs do rename MXCSR; a perf event for MXCSR rename stall cycles exists on some microarchitecture (not sure which):

Stalls due to the MXCSR register rename occurring too close to a previous MXCSR rename.

So changing it wouldn't have to drain the scheduler, but it's not great. And according to that wording, changing it twice nearby could be bad. IDK if there's maybe just a limited amount of physical MXCSR entries to rename onto, or some other reason for that limitation.

Of course in a loop you wouldn't store, bit-flip, and reload MXCSR values; you have two MXCSR values in memory and just ldmxcsr them.



来源:https://stackoverflow.com/questions/58524438/does-sse-avx-provide-a-means-of-determining-if-a-result-was-rounded-up

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!