The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant
If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double
value x
, the next representable value up from x
is always given by x / 0x1.fffffffffffffp-1
(where 0x1.fffffffffffffp-1
is just 1.0 - 0.5 * DBL_EPSILON
spelled out as a hex literal).
So we can get the most significant bit that you ask for simply from:
(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
And of course there are analogous results for float
, assuming IEEE 754 binary32 format and semantics.
In fact, the only normal positive value that this fails for is DBL_MAX
, where the result of the division overflows to infinity.
To show that the division trick works, it's enough to prove it for x
in the range 1.0 <= x < 2.0
; it's easy to show that for any x
in this range, the value of x / 0x1.fffffffffffffp-1 - x
(where /
represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52]
, and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1
rounds up to the next representable value.
Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1
is always the next representable value down from x
.
Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2⌈log2 |p|⌉):
double ULP(double q)
{
// SmallestPositive is the smallest positive floating-point number.
static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
/* Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
something in [.75 ULP, 1.5 ULP) (even with rounding).
*/
static const double Scale = 0.75 * DBL_EPSILON;
q = fabs(q);
// Handle denormals, and get the lowest normal exponent as a bonus.
if (q < 2*DBL_MIN)
return SmallestPositive;
/* Subtract from q something more than .5 ULP but less than 1.5 ULP. That
must produce q - 1 ULP. Then subtract that from q, and we get 1 ULP.
The significand 1 is of particular interest. We subtract .75 ULP from
q, which is midway between the greatest two floating-point numbers less
than q. Since we round to even, the lesser one is selected, which is
less than q by 1 ULP of q, although 2 ULP of itself.
*/
return q - (q - q * Scale);
}
The fabs
and if
can be replaced with ?:
.
For reference, the 2⌈log2 |p|⌉ algorithm is:
q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
L = |p|
For the sake of example, assume the type is float
and let x
be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.
float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;
If we could ensure rounding-down, the initial value of y
should be exactly what we want. However, if the top two bits of x
are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON)
could exceed x
by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.
Written as macros:
#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))
#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)