问题
One of the purposes of the C1
bit in the x87 FPU status word is to show whether or not an inexact result was rounded up.
Does SSE/AVX provide any such indication for scalar operations?
I did not see a similar bit in the MXCSR
register. Am I forced to use x87 instructions if I want this information?
回答1:
SSE/AVX do not provide hardware support for detecting this, even for scalar instructions like addss
. SSE was designed for SIMD, with 4 floats per XMM vector, and presumably Intel didn't want to provide a bitmap of 4 bits in the MXCSR. Although that would have been a possible design choice.
As @Mysticial points out in comments, it can be possible to calculate it using extra instructions.
(Untested idea that might do what you want. I think this should work even with subnormals and so on; compare for exact equality is the same as bitwise compare except for -0.0 == +0.0, or for NaN)
With AVX512, you might do your add/sub/mul/div/sqrt calculation normally (with default rounding), then again with a rounding-mode override to truncation towards 0. Use vcmpps
for equality on the results. The elements that compare exactly equal were rounded toward 0 by the default rounding mode (or were exact both times). Of course you could use towards -Inf of towards +Inf as your override to detect that instead of toward 0.
AVX512's EVEX prefix can encode a rounding mode override on a per-instruction basis, without changing MXCSR. This makes it efficiently possible to do this, significantly more efficiently than changing MXCSR. e.g. _mm512_add_round_ps (__m512 a, __m512 b, int);
. Note that AVX512 embedded-rounding (er
) is only available for 512-bit vectors; you unfortunately can't use it with AVX512VL to do rounding overrides on 256-bit vectors to avoid the current max-turbo and other downsides of using 512-bit vectors on current Skylake-family CPUs. Using ER also applies SAE (suppress-all-exceptions), meaning the instruction doesn't have to update MXCSR at all. AVX-512 Instruction Encoding - {er} Meaning.
In asm syntax, rz
= round toward zero. See Table 2-36. EVEX Embedded Broadcast/Rounding/SAE and Vector Length on Vector Instructions in Intel's vol.2 x86 manual.
vaddpd zmm2, zmm1, zmm0 ; no override, or {rne-sae} would be Nearest-Even
vaddpd zmm3, zmm1, zmm0, {rz-sae} ; rounding = truncation toward Zero
vcmpneqpd k1, zmm2, zmm3 ; compare for not-equal
;;; k1 = bitmask
;; 0 means rounded toward 0 or exact
;; 1 means rounded away from 0
If you don't need the primary result to be a 512-bit vector, you can do that and the compare with XMM or YMM registers, but the {rz-sae}
operation has to be ZMM. YMM compare gives you the option of comparing into another YMM register (AVX1) instead of into an AVX512 mask register. But if you're using AVX512, mask registers are usually pretty nice.
This always needs 2 extra instructions: repeating the operation and a compare. Mysticial's suggestion to use an FMA after mulps
might avoid that, if you just use the sign bit directly instead of comparing against zero. e.g. vmovmskps
to get an integer bitmap, or vxorps
or vandps
to combine some vectors where the "truth value" you care about is the sign bit. This might be an input for vblendvps
(which also only looks at sign bits), or for an eventual vmovmskps
.
Changing the rounding mode without AVX512 might not be a total disaster, especially if you can do a few vectors with default before changing to truncation and redoing them. That might make it more efficient than a rounding-direction-detection sequence that took 3 or more instructions per vector if you have enough registers to play with to amortize the MXCSR changes over enough operations.
Apparently some Intel CPUs do rename MXCSR; a perf event for MXCSR rename stall cycles exists on some microarchitecture (not sure which):
Stalls due to the MXCSR register rename occurring too close to a previous MXCSR rename.
So changing it wouldn't have to drain the scheduler, but it's not great. And according to that wording, changing it twice nearby could be bad. IDK if there's maybe just a limited amount of physical MXCSR entries to rename onto, or some other reason for that limitation.
Of course in a loop you wouldn't store, bit-flip, and reload MXCSR values; you have two MXCSR values in memory and just ldmxcsr
them.
来源:https://stackoverflow.com/questions/58524438/does-sse-avx-provide-a-means-of-determining-if-a-result-was-rounded-up