问题
Here are two implementations of interpolation functions. Argument u1
is always between 0.
and 1.
.
#include <stdio.h>
double interpol_64(double u1, double u2, double u3)
{
return u2 * (1.0 - u1) + u1 * u3;
}
double interpol_80(double u1, double u2, double u3)
{
return u2 * (1.0 - (long double)u1) + u1 * (long double)u3;
}
int main()
{
double y64,y80,u1,u2,u3;
u1 = 0.025;
u2 = 0.195;
u3 = 0.195;
y64 = interpol_64(u1, u2, u3);
y80 = interpol_80(u1, u2, u3);
printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80);
}
On a strict IEEE 754 platform with 80-bit long double
s, all computations in interpol_64()
are done according to IEEE 754 double precision, and in interpol_80()
in 80-bit extended precision.
The program prints:
u2: 0x1.8f5c28f5c28f6p-3
y64:0x1.8f5c28f5c28f5p-3
y80:0x1.8f5c28f5c28f6p-3
I am interested in the property “the result returned by the function is always in-between u2
and u3
”. This property is false of interpol_64()
, as shown by the values in the main()
above.
Does the property have a chance to be true of interpol_80()
? If it isn't, what is a counter-example? Does it help if we know that u2 != u3
or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true?
EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If interpol_80()
took long double
arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes double
arguments. This makes it much harder to build a counter-example, if there is one.
Note: a compiler generating x87 instructions may generate the same code for interpol_64()
and interpol_80()
, but this is tangential to my question.
回答1:
Yes, interpol_80() is safe, let's demonstrate it.
The problem states that inputs are 64bits float
rnd64(ui) = ui
The result is exactly (assuming * and + are mathematical operations)
r = u2*(1-u1)+(u1*u3)
Optimal return value rounded to 64 bit float is
r64 = rnd64(r)
As we have these properties
u2 <= r <= u3
It is guaranteed that
rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3
Conversion to 80bits of u1,u2,u3 are exact too.
rnd80(ui)=ui
Now, let's assume 0 <= u2 <= u3
, then performing with inexact float operations leads to at most 4 rounding errors:
rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))
Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats:
r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)
rf64
can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )
?
We can tell that:
rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))
Since 0 <= u2 <= u3
, then
ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)
Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2)
we get
rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3
since ulp80(x)=ulp62(x)/2^(64-53)
We thus get the proof
u2 <= rnd64(rf80) <= u3
For u2 <= u3 <= 0, we can apply same proof easily.
The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...
Thus this assertion we made doesn't hold anymore:
r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3
and this is preserved after rounding
u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3
Thus since added quantities are of opposite sign:
u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3
same goes after rounding, so we can once again guaranty
u2 <= rnd64( rf80 ) <= u3
QED
To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...
EDIT:
Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)
We can write the following inequalities:
rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2
For next rounding operation, we use
ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)
For second part of the sum, we have:
u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2
rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2
I didn't prove the original claim, but not that far...
回答2:
The main source of loss-of-precision in interpol_64
is the multiplications. Multiplying two 53-bit mantissas yields a 105- or 106-bit (depending on whether the high bit carries) mantissa. This is too large to fit in an 80-bit extended precision value, so in general, you'll also have loss-of-precision in the 80-bit version. Quantifying exactly when it happens is very difficult; the most that's easy to say is that it happens when rounding errors accumulate. Note that there's also a small rounding step when adding the two terms.
Most people would probably just solve this problem with a function like:
double interpol_64(double u1, double u2, double u3)
{
return u2 + u1 * (u3 - u2);
}
But it looks like you're looking for insight into the rounding issues, not a better implementation.
来源:https://stackoverflow.com/questions/13725802/properties-of-80-bit-extended-precision-computations-starting-from-double-precis