In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K)
, where a
is a positive v
The problem is the addition in (a + K)
. Any loss of precision in (a + K)
is magnified by the division. The problem isn't the division itself.
If the exponents of a
and K
are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a
(if a
has larger magnitude) or (a + K) == K
(if K
has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest
(where smallest
is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp;
won't help much either because you've still got exactly the same (a + K)
problem (but it would avoid a similar problem in (a - K)
). Also you can't do result = anything / p + anything_error/p_error
because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a
that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a
, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a
is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K);
then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;
. Next do a_lower = a
and decrease a_lower
by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound
, then revert back to the previous value of a_lower
. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound
and a_upper
(starting with the original value of a
). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;
. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)