In various contexts, for example for the argument reduction for mathematical functions, one needs to compute (a - K) / (a + K)
, where a
is a positive v
One possibility is to track error of m and p into m1 and p1 with classical Dekker/Schewchuk:
m=a-k;
k0=a-m;
a0=k0+m;
k1=k0-k;
a1=a-a0;
m1=a1+k1;
p=a+k;
k0=p-a;
a0=p-k0;
k1=k-k0;
a1=a-a0;
p1=a1+k1;
Then, correct the naive division:
q=m/p;
r0=fmaf(p,-q,m);
r1=fmaf(p1,-q,m1);
r=r0+r1;
q1=r/p;
q=q+q1;
That'll cost you 2 divisions, but should be near half ulp if I didn't screw up.
But these divisions can be replaced by multiplications with inverse of p without any problem, since the first incorrectly rounded division will be compensated by remainder r, and second incorrectly rounded division does not really matter (the last bits of correction q1 won't change anything).
If you can relax the API to return another variable that models the error, then the solution becomes much simpler:
float foo(float a, float k, float *res)
{
float ret=(a-k)/(a+k);
*res = fmaf(-ret,a+k,a-k)/(a+k);
return ret;
}
This solution only handles truncation error of division, but does not handle the loss of precision of a+k
and a-k
.
To handle those errors, I think I need to use double precision, or bithack to use fixed point.
Test code is updated to artificially generate non zero least significant bits in the input
test code
https://ideone.com/bHxAg8
If a is large compared to K, then (a-K)/(a+K) = 1 - 2K / (a + K) will give a good approximation. If a is small compared to K, then 2a / (a + K) - 1 will give a good approximation. If K/2 ≤ a ≤ 2K, then a-K is an exact operation, so doing the division will give a decent result.
Since my goal is to merely widen the interval on which accurate results are achieved, rather than to find a solution that works for all possible values of a
, making use of double-float
arithmetic for all intermediate computation seems too costly.
Thinking some more about the problem, it is clear that the computation of the remainder of the division, e
in the code from my question, is the crucial part of achieving more accurate result. Mathematically, the remainder is (a-K) - q * (a+K). In my code, I simply used m
to represent (a-K) and represented (a+k) as m + 2*K
, as this delivers numerically superior results to the straightforward representation.
With relatively small additional computational cost, (a+K) can be represented as a double-float
, that is, a head-tail pair p:plo
, which leads to the following modified version of my original code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 2 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mx = fmaxf (a, K);
mn = fminf (a, K);
plo = (mx - p) + mn;
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
q = fmaf (r, e, q);
Testing shows that this delivers nearly correctly rounded results for a
in [K/2, 224*K), allowing for a substantial increase to the upper bound of the interval on which accurate results are achieved.
Widening the interval at the lower end requires the more accurate representation of (a-K). We can compute this as a double-float
head-tail pair m:mlo
, which leads to the following code variant:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 3 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
plo = (a < K) ? ((K - p) + a) : ((a - p) + K);
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -p, m);
e = fmaf (q, -plo, t);
e = e + mlo;
q = fmaf (r, e, q);
Exhaustive testing hows that this delivers nearly correctly rounded results for a
in the interval [K/224, K*224). Unfortunately, this comes at a cost of ten additional operations compared to the code in my question, which is a steep price to pay to get the maximum error from around 1.625 ulps with the naive computation down to near 0.5 ulp.
As in my original code from the question, one can express (a+K) in terms of (a-K), thus eliminating the computation of the tail of p
, plo
. This approach results in the following code:
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 4 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
mlo = (a < K) ? (a - (K + m)) : ((a - m) - K);
t = fmaf (q, -2.0f*K, m);
t = fmaf (q, -m, t);
e = fmaf (q - 1.0f, -mlo, t);
q = fmaf (r, e, q);
This turns out to be advantageous if the main focus is decreasing the lower limit of the interval, which is my particular focus as explained in the question. Exhaustive testing of the single-precision case shows that when K=2n nearly correctly rounded results are produced for values of a
in the interval [K/224, 4.23*K]. With a total of 14 or 15 operations (depending on whether an architecture supports full predication or just conditional moves), this requires seven to eight more operations than my original code.
Lastly, one might base the residual computation directly on the original variable a
to avoid the error inherent in the computation of m
and p
. This leads to the following code that, for K = 2n, computes nearly correctly rounded results for a
in the interval [K/224, K/3):
/* Compute q = (a - K) / (a + K) with improved accuracy. Variant 5 */
m = a - K;
p = a + K;
r = 1.0f / p;
q = m * r;
t = fmaf (q + 1.0f, -K, a);
e = fmaf (q, -a, t);
q = fmaf (r, e, q);
I don't really have an answer (proper floating point error analyses are very tedious) but a few observations:
m
is computed exactly if a ∈ [0.5×Kb, 21+n×Kb), where Kb is the power of 2 below K (or K itself if K is a power of 2), and n is the number of trailing zeros in the significand of K (i.e. if K is a power of 2, then n=23).div2
algorithm from Dekker (1971): to expand the range (particularly the lower bound), you'll probably have to incorporate more correction terms from this (i.e. store m
as the sum of 2 float
s, or use a double
).The problem is the addition in (a + K)
. Any loss of precision in (a + K)
is magnified by the division. The problem isn't the division itself.
If the exponents of a
and K
are the same (almost) no precision is lost, and if the absolute difference between the exponents is greater than the significand size then either (a + K) == a
(if a
has larger magnitude) or (a + K) == K
(if K
has larger magnitude).
There is no way to prevent this. Increasing the significand size (e.g. using 80-bit "extended double" on 80x86) only helps widen the "accurate result range" slightly. To understand why, consider smallest + largest
(where smallest
is the smallest positive denormal a 32-bit floating point number can be). In this case (for 32-bit floats) you'd need a significand size of about 260 bits for the result to avoid precision loss completely. Doing (e.g.) temp = 1/(a + K); result = a * temp - K / temp;
won't help much either because you've still got exactly the same (a + K)
problem (but it would avoid a similar problem in (a - K)
). Also you can't do result = anything / p + anything_error/p_error
because division doesn't work like that.
There are only 3 alternatives I can think of to get close to 0.5 ulps for all possible positive values of a
that can fit in 32-bit floating point. None are likely to be acceptable.
The first alternative involves pre-computing a lookup table (using "big real number" maths) for every value of a
, which (with some tricks) ends up being about 2 GiB for 32-bit floating point (and completely insane for 64-bit floating point). Of course if the range of possible values of a
is smaller than "any positive value that can fit in a 32-bit float" the size of the lookup table would be reduced.
The second alternative is to use something else ("big real number") for the calculation at run-time (and convert to/from 32-bit floating point).
The third alternative involves, "something" (I don't know what it's called, but it's expensive). Set the rounding mode to "round to positive infinity" and calculate temp1 = (a + K); if(a < K) temp2 = (a - K);
then switch to "round to negative infinity" and calculate if(a >= K) temp2 = (a - K); lower_bound = temp2 / temp1;
. Next do a_lower = a
and decrease a_lower
by the smallest amount possible and repeat the "lower_bound" calculation, and keep doing that until you get a different value for lower_bound
, then revert back to the previous value of a_lower
. After that you do essentially the same (but opposite rounding modes, and incrementing not decrementing) to determine upper_bound
and a_upper
(starting with the original value of a
). Finally, interpolate, like a_range = a_upper - a_lower; result = upper_bound * (a_upper - a) / a_range + lower_bound * (a - a_lower) / a_range;
. Note that you will want to calculate an initial upper and lower bound and skip all of this if they're equal. Also be warned that this is all "in theory, completely untested" and I probably borked it somewhere.
Mainly what I'm saying is that (in my opinion) you should give up and accept that there's nothing that you can do to get close to 0.5 ulp. Sorry.. :)