Is there a correct constant-expression, in terms of a float, for its msb?

问题

The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant place of the significand? Equivalently, this is just the greatest power of two less than or equal to the input in magnitude.

For the purposes of this question we can ignore:

Near-overflow or near-underflow values (they can be handled with finitely many applications of ?: to rescale).
Negative inputs (they can be handled likewise).
Non-Annex-F-conforming implementations (can't really do anything useful in floating point with them).
Weirdness around excess precision (float_t and double_t can be used with FLT_EVAL_METHOD and other float.h macros to handle it safely).

So it suffices to solve the problem for positive values bounded away from infinity and the denormal range.

Note that this problem is equivalent to finding the "epsilon" for a specific value, that is, nextafter(x,INF)-x (or the equivalent in float or long double), with the result just scaled by DBL_EPSILON (or equivalent for the type). Solutions that find that are perfectly acceptable if they're simpler.

I have a proposed solution I'm posting as a self-answer, but I'm not sure if it's correct.

回答1:

Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2^{⌈log₂ |p|⌉}):

double ULP(double q)
{
    // SmallestPositive is the smallest positive floating-point number.
    static const double SmallestPositive = DBL_EPSILON * DBL_MIN;

    /*  Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
        something in [.75 ULP, 1.5 ULP) (even with rounding).
    */
    static const double Scale = 0.75 * DBL_EPSILON;

    q = fabs(q);

    // Handle denormals, and get the lowest normal exponent as a bonus.
    if (q < 2*DBL_MIN)
        return SmallestPositive;

    /*  Subtract from q something more than .5 ULP but less than 1.5 ULP.  That
        must produce q - 1 ULP.  Then subtract that from q, and we get 1 ULP.

        The significand 1 is of particular interest.  We subtract .75 ULP from
        q, which is midway between the greatest two floating-point numbers less
        than q.  Since we round to even, the lesser one is selected, which is
        less than q by 1 ULP of q, although 2 ULP of itself.
    */
    return q - (q - q * Scale);
}

The fabs and if can be replaced with ?:.

For reference, the 2^{⌈log₂ |p|⌉} algorithm is:

q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
    L = |p|

回答2:

If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).

So we can get the most significant bit that you ask for simply from:

(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52

And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.

In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.

To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.

Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.

回答3:

For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.

float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;

If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.

Written as macros:

#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))

#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)

来源：https://stackoverflow.com/questions/53598657/is-there-a-correct-constant-expression-in-terms-of-a-float-for-its-msb

标签

floating-point

constant-expression