Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.
The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.
If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale
in the code below is 2b, then *x0
receives the high s-b bits of x
, and *x1
receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale
with the appropriate constant, such as 0x1p43
. Otherwise, you must produce 2b in some way.
This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.
This assumes that x * (Scale + 1)
does not overflow. The operations must be evaluated in the same precision as the value being separated. (double
for double
, float
for float
, and so on. If the compiler evaluates float
expressions with double
, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale
adjusted correspondingly], and then convert back.)
void Split(double *x0, double *x1, double x)
{
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;
}