问题
For those that have already measured or have deep knowledge about this kind of considerations, assume that you have to do the following (just to pick any for the example) floating-point operator:
float calc(float y, float z)
{ return sqrt(y * y + z * z) / 100; }
Where y
and z
could be denormal numbers, let's assume two possible situations where just y, just z, or maybe both, in a totally random manner, can be denormal numbers
- 50% of the time
- <1% of the time
And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0, and I change that piece of code by:
float calc(float y, float z)
{
bool yzero = y < 1e-37;
bool zzero = z < 1e-37;
bool all_zero = yzero and zzero;
bool some_zero = yzero != zzero;
if (all_zero)
return 0f;
float ret;
if (!some_zero) ret = sqrt(y * y + z * z);
else if (yzero) ret = z;
else if (zzero) ret = y;
return ret / 100;
}
What will be worse, the performarce penalty for branch misprediction (for the 50% or <1% cases), or the performance penalty for working with denormal numbers?
To properly interpret which operations can be normal or denormal in the previous piece of code I would like as well to get some one-lined but totally optional answers about the following closely related questions:
float x = 0f; // Will x be just 0 or maybe some number like 1e-40;
float y = 0.; // I assume the conversion is just thin-air here and the compiler will see just a 0.
0; // Is "exact zero" a normal or a denormal number?
float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?
float zz = x / c; // What about a "no-op" operating against any compiler-time constant?
bool yzero = y < 1e-37; // Have comparisions any performance penalty when y is denormal or they don't?
回答1:
There's HW support for this for free in many ISAs including x86, see below re: FTZ / DAZ. Most compilers set those flags during startup when you compile with -ffast-math
or equivalent.
Also note that your code fails to avoid the penalty (on HW where there is any) in some cases: y * y
or z * z
can be subnormal for small but normalized y
or z
. (Good catch, @chtz). The exponent of y*y
is twice the exponent of y
, more negative or more positive. With 23 explicit mantissa bits in a float, that's about 12 exponent values that are the square roots of subnormal values, and wouldn't underflow all the way to 0
.
Squaring a subnormal always gives underflow to 0
; subnormal input may be less likely to have a penalty than subnormal output for a multiply, I don't know. Having a subnormal penalty or not can vary by operation within one microarchitecture, like add/sub vs. multiply vs. divide.
Also, any negative y
or z
gets treated as 0
, which is probably a bug unless your inputs are known non-negative.
if results can vary so widely, x86 microarchitectures will be my main use case
Yes, penalties (or lack thereof) vary greatly.
Historically (P6-family) Intel used to always take a very slow microcode assist for subnormal results and subnormal inputs, including for compares. Modern Intel CPUs (Sandybridge-family) handle some but not all FP operations on subnormal operands without needing a microcode assist. (perf event fp_assists.any
)
The microcode assist is like an exception and flushes the out-of-order pipeline, and takes over 160 cycles on SnB-family, vs. ~10 to 20 for a branch miss. And branch misses have "fast recovery" on modern CPUs. True branch-miss penalty depends on surrounding code; e.g. if the branch condition is really late to be ready it can result in discarding a lot of later independent work. But a microcode assist is still probably worse if you expect it to happen frequently.
Note that you can check for a subnormal using integer ops: just check the exponent field for all zero (and the mantissa for non-zero: the all-zero encoding for 0.0
is technically a special case of a subnormal). So you could manually flush to zero with integer SIMD operations like andps
/pcmpeqd
/andps
Agner Fog's microarch PDF has some info; he mentions this in general without a fully detailed breakdown for each uarch. I don't think https://uops.info/ tests for normal vs. subnormal unfortunately.
Knight's Landing (KNL) only has subnormal penalties for division, not add / mul. Like GPUs, they took an approach that favoured throughput over latency and have enough pipeline stages in their FPU to handle subnormals in the hardware equivalent of branchlessly. Even though this might mean higher latency for every FP operation.
AMD Bulldozer / Piledriver have a ~175 cycle penalty for results that are "subnormal or underflow", unless FTZ is set. Agner doesn't mention subnormal inputs. Steamroller/Excavator don't have any penalties.
AMD Ryzen (from Agner Fog's microarch pdf)
Floating point operations that give a subnormal result take a few clock cycles extra. The same is the case when a multiplication or division underflows to zero. This is far less than the high penalty on the Bulldozer and Piledriver. There is no penalty when flush-to-zero mode and denormals-are-zero mode are both on.
By contrast, Intel Sandybridge-family (at least Skylake) doesn't have penalties for results that underflow all the way to 0.0.
Intel Silvermont (Atom) from Agner Fog's microarch pdf
Operations that have subnormal numbers as input or output or generate underflow take approximately 160 clock cycles unless the flush-to-zero mode and denormals-are-zero mode are both used.
This would include compares.
I don't know the details for any non-x86 microarchitectures, like ARM cortex-a76 or any RISC-V to pick a couple random examples that might also be relevant. Mispredict penalties vary wildly as well, across simple in-order pipelines vs. deep OoO exec CPUs like modern x86. True mispredict penalty also depends on surrounding code.
And now assume I want to avoid the performance penalty of dealing with denormal numbers and I just want to treat them as 0
Then you should set your FPU to do that for you for free, removing all possibility of penalties from subnormals.
Some / most(?) modern FPUs (including x86 SSE but not legacy x87) let you treat subnormals (aka denormals) as zero for free, so this problem only occurs if you want this behaviour for some functions but not all, within the same thread. And with too fine-grained switching to be worth changing the FP control register to FTZ and back.
Or could be relevant if you wanted to write fully portable code that was terrible nowhere, even if it meant ignoring HW support and thus being slower than it could be.
Some x86 CPUs do even rename MXCSR so changing the rounding mode or FTZ/DAZ might not have to drain the out-of-order back-end. It's still not cheap and you'd want to avoid doing it every few FP instructions.
ARM also supports a similar feature: subnormal IEEE 754 floating point numbers support on iOS ARM devices (iPhone 4) - but apparently the default setting for ARM VFP / NEON is to treat subnormals as zero, favouring performance over strict IEEE compliance.
See also flush-to-zero behavior in floating-point arithmetic about cross-platform availability of this.
On x86 the specific mechanism is that you set the DAZ and FTZ bits in the MXCSR register (SSE FP math control register; also has bits for FP rounding mode, FP exception masks, and sticky FP masked-exception status bits). https://software.intel.com/en-us/articles/x87-and-sse-floating-point-assists-in-ia-32-flush-to-zero-ftz-and-denormals-are-zero-daz shows the layout and also discusses some performance effects on older Intel CPUs. Lots of good background / introduction.
Compiling with -ffast-math
will link in some extra startup code that sets FTZ/DAZ before calling main
. IIRC, threads inherit the MXCSR settings from the main thread on most OSes.
- DAZ = Denormals Are Zero, treats input subnormals as zero. This affects compares (whether or not they would have experienced a slowdown) making it impossible to even tell the difference between
0
and a subnormal other than using integer stuff on the bit-pattern. - FTZ = Flush To Zero, subnormal outputs from calculations are just underflowed to zeroed. i.e. disable gradual underflow. (Note that multiplying two small normal numbers can underflow. I think add/sub of normal numbers whose mantissas cancel out except for the low few bits could produce a subnormal as well.)
Usually you simply set both or neither. If you're processing input data from another thread or process, or compile-time constants, you could still have subnormal inputs even if all results you produce are normalized or 0.
Specific random questions:
float x = 0f; // Will x be just 0 or maybe some number like 1e-40;
This is a syntax error. Presumably you mean 0.f
or 0.0f
0.0f is exactly representable (with the bit-pattern 0x00000000
) as an IEEE binary32 float, so that's definitely what you will get on any platform that uses IEEE FP. You won't randomly get subnormals that you didn't write.
float z = x / 1; // Will this "no-op" (x == 0) cause z be something like 1e-40 and thus denormal?
No, IEEE754 doesn't allow 0.0 / 1.0
to give anything other than 0.0
.
Again, subnormals don't appear out of thin air. Rounding "error" only happens when the exact result can't be represented as a float or double. The max allowed error for the IEEE "basic" operations (* / + - and sqrt
) is 0.5 ulp, i.e. the exact result must be correctly rounded to the nearest representable FP value, right down to the last digit of the mantissa.
bool yzero = y < 1e-37; // Have comparisons any performance penalty when y is denormal or they don't?
Maybe, maybe not. No penalty on recent AMD or Intel, but is slow on Core 2 for example.
Note that 1e-37
has type double
and will cause promotion of y
to double
. You might hope that this would actually avoid subnormal penalties vs. using 1e-37f
. Subnormal float->int has no penalty on Core 2, but unfortunately cvtss2sd
does still have the large penalty on Core 2. (GCC/clang don't optimize away the conversion even with -ffast-math
, although I think they could because 1e-37
is exactly representable as a flat, and every subnormal float can be exactly represented as a normalized double. So the promotion to double is always exact and can't change the result).
On Intel Skylake, comparing two subnormals with vcmplt_oqpd
doesn't result in any slowdown, and not with ucomisd
into integer FLAGS either. But on Core 2, both are slow.
Comparison, if done like subtraction, does have to shift the inputs to line up their binary place-values, and the implied leading digit of the mantissa is a 0
instead of 1
so subnormals are a special case. So hardware might choose to not handle that on the fast path and instead take a microcode assist. Older x86 hardware might handle this slower.
It could be done differently if you built a special compare ALU separate from the normal add/sub unit. Float bit-patterns can be compared as sign/magnitude integers (with a special case for NaN) because the IEEE exponent bias is chosen to make that work. (i.e. nextafter
is just integer ++ or -- on the bit pattern). But this apparently isn't what hardware does.
FP conversion to integer is fast even on Core 2, though. cvt[t]ps2dq
or the pd equivalent convert packed float/double to int32 with truncation or the current rounding mode. So for example this recent proposed LLVM optimization is safe on Skylake and Core 2, according to my testing.
Also on Skylake, squaring a subnormal (producing a 0
) has no penalty. But it does have a huge penalty on Conroe (P6-family).
But multiplying normal numbers to produce a subnormal result has a penalty even on Skylake (~150x slower).
来源:https://stackoverflow.com/questions/60969892/performance-penalty-denormalized-numbers-versus-branch-mis-predictions