Assume that t
,a
,b
are all double (IEEE Std 754) variables, and both values of a
, b
are NOT NaN
(but may be Inf
).
After t = a - b
, do I necessarily have a == b + t
?
Absolutely not. One obvious case is a=DBL_MAX
, b=-DBL_MAX
. Then t=INFINITY
, so b+t
is also INFINITY
.
What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b
is inexact. For example, if a
is DBL_EPSILON/4
and b
is -1
, a-b
is 1 (assuming default rounding mode), and a-b+b
is then 0.
The reason I mention this second example is that this is the canonical way of forcing rounding to a particular precision in IEEE arithmetic. For instance, if you have a number in the range [0,1) and want to force rounding it to 4 bits of precision, you would add and then subtract 0x1p49
.
In the process of doing the first operation, bits could have been lost off the low end of the result. So one question is, will the second operation exactly reproduce those losses? I haven't fully thought that out.
But, of course, the first operation could have overflowed to +/-infinity, rendering the second compare unequal.
(And, of course, in the general case using ==
for floating-point values is almost always a bug.)
You are not guaranteed anything when using floats. If the exponent is different for both numbers, the result of an arithmetic operation may not be completely representable in a float.
Consider this code:
float a = 0.003f;
float b = 10000000.0f;
float t = a - b;
float x = b + t;
Running on Visual Studio 2010, you get t==-10000000.0f
, and therefore x==0
.
You should never use equality when comparing floats. Instead compare the absolute value of the difference between both values and an epsilon value small enough for your precision needs.
It gets even weirder as different floating point implementations may return different results for the same operation.
来源:https://stackoverflow.com/questions/10791894/ieee-std-754-floating-point-let-t-a-b-does-the-standard-guarantee-that-a