IEEE Std 754 Floating-Point: let t := a - b, does the standard guarantee that a == b + t?

十年热恋 提交于 2019-12-05 01:37:42

Absolutely not. One obvious case is a=DBL_MAX, b=-DBL_MAX. Then t=INFINITY, so b+t is also INFINITY.

What may be more surprising is that there are cases where this happens without any overflow. Basically, they're all of the form where a-b is inexact. For example, if a is DBL_EPSILON/4 and b is -1, a-b is 1 (assuming default rounding mode), and a-b+b is then 0.

The reason I mention this second example is that this is the canonical way of forcing rounding to a particular precision in IEEE arithmetic. For instance, if you have a number in the range [0,1) and want to force rounding it to 4 bits of precision, you would add and then subtract 0x1p49.

In the process of doing the first operation, bits could have been lost off the low end of the result. So one question is, will the second operation exactly reproduce those losses? I haven't fully thought that out.

But, of course, the first operation could have overflowed to +/-infinity, rendering the second compare unequal.

(And, of course, in the general case using == for floating-point values is almost always a bug.)

You are not guaranteed anything when using floats. If the exponent is different for both numbers, the result of an arithmetic operation may not be completely representable in a float.

Consider this code:

float a = 0.003f;
float b = 10000000.0f;
float t = a - b;
float x = b + t;

Running on Visual Studio 2010, you get t==-10000000.0f, and therefore x==0.

You should never use equality when comparing floats. Instead compare the absolute value of the difference between both values and an epsilon value small enough for your precision needs.

It gets even weirder as different floating point implementations may return different results for the same operation.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!