Floating point comparison revisited

自闭症网瘾萝莉.ら 提交于 2019-11-27 10:48:47

“Almost Equals” Is Not a Good Function

4 is not an appropriate value: The answer you point to states “Therefore, 4 should be enough for ordinary use” but contains no basis for that claim. In fact, there are ordinary situations in which numbers calculated in floating-point by different means may differ by many ULP even though they would be equal if calculated by exact mathematics. Therefore, there should be no default value for the tolerance; each user should be required to supply their own, hopefully based on thorough analysis of their code.

As an example of why a default of 4 ULP is bad, consider 1./49*49-1. The mathematically exact result is 0, but the computed result (64-bit IEEE 754 binary) is -0x1p-53, an error exceeding 1e307 ULP of the exact result and almost 1e16 ULP of the computed result.

Sometimes, no value is appropriate: In some cases, the tolerance cannot be relative to the values being compared, neither a mathematically exact relative tolerance nor a quantized ULP tolerance. For example, nearly every output value in an FFT is affected by nearly every input value, and the error in any one element is related to the magnitude of other elements. An “almost equals” routine must be supplied additional context with information about the potential error.

“Almost Equals” has poor mathematical properties: This shows one of the shortcomings of “almost equals”: Scaling changes the results. The code below prints 1 and 0.

double x0 = 1.1;
double x1 = 1.1 + 3*0x1p-52;
std::cout << almostEqual(x0, x1) << "\n";
x0 *= .8;
x1 *= .8;
std::cout << almostEqual(x0, x1) << "\n";

Another failing is that it is not transitive; almostEqual(a, b) and almostEqual(b, c) does not imply almostEqual(a, c).

A Bug in Extreme Cases

almostEqual(1.f, 1.f/11, 0x745d17) incorrectly returns 1.

1.f/11 is 0x1.745d18p-4. Subtracting this from 1 (which is 0x10p-4) yields 0xe.8ba2e8p-4. Since an ULP of 1 is 0x1p-23, that is 0xe.8ba2e8p19 ULP = 0xe8ba2e.8/2 ULP (shifted 20 bits and divided by 2, netting 19 bits) = 0x745d17.4 ULP. That exceeds the specified tolerance of 0x745d17, so the correct answer would be 0.

This error is caused by rounding in max_frac-scaled_min_frac.

An easy escape from this problem is to specify that ulps must be less than .5/limits::epsilon. Then rounding occurs in max_frac-scaled_min_frac only if the difference (even when rounded) exceeds ulps; if the difference is less than that, the subtraction is exact, by Sterbenz’ Lemma.

There was a suggestion about using long double to correct this. However, long double would not correct this. Consider comparing 1 and -0x1p-149f with ulps set to 1/limits::epsilon. Unless your significand has 149 bits, the subtraction result rounds to 1, which is less than or equal to 1/limits::epsilon ULP. Yet the mathematical difference clearly exceeds 1.

Minor Note

The expression factor * limits::epsilon / 2 converts factor to the floating-point type, which causes rounding errors for large values of factor that are not exactly representable. Likely, the routine is not intended to be used with such large values (millions of ULPs in float), so this ought to be specified as a limit on the routine rather than a bug.

Simplification: You could avoid my_frexp by discarding the non finite cases first all-together:

if( ! std::isfinite(a) || ! std::isfinite(b) )
    return a == b;

It seems that isfinite is in C++11 at least

EDIT However, if intention is to have limits::infinity() within 1 ulp of limits::max()
then above simplification does not hold, but shouldn't my_frexp() return limits::max_exponent+1 in *exp, rather than max_exponent+2 ?

FUTURE PROOFING: If you ever want to extend such comparison to decimal float http://en.wikipedia.org/wiki/Decimal64_floating-point_format in the future, and assuming that ldexp() and frexp() will handle such type with correct radix, then striclty speaking, 0.5 in return std::copysign(0.5, num); should be replaced by T(1)/limits::radix() - or std::ldexp(T(1),-1) or something... (I could not find a convenient constant in std::numeric_limits)

EDIT As Nemo commented, the assumptions that ldexp and frexp would use the correct FLOAT_RADIX are false, they stick with 2...

So a Future Proof portable version should also use:

  • std::scalbn(x,n) instead of std::ldexp(x,n)

  • exp=std::ilogb(std::abs(x)),y=std::scalbn(x,-exp) instead of y=frexp(x,&exp)

  • now that above y in is [1,FLOAT_RADIX) instead of [T(1)/Float_Radix,1), return copysign(T(1),num) instead of 0.5 for infinite case of my_frexp, and test for ulps*limits::epsilon() instead of ulps*epsilon()/2

That also require a standard >= C++11

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!