Is there a correct constant-expression, in terms of a float, for its msb?

后端 未结 3 574
孤城傲影
孤城傲影 2021-01-21 14:45

The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant

相关标签:
3条回答
  • 2021-01-21 14:56

    If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).

    So we can get the most significant bit that you ask for simply from:

    (x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
    

    And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.

    In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.

    To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.

    Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.

    0 讨论(0)
  • 2021-01-21 15:06

    Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2⌈log2 |p|⌉):

    double ULP(double q)
    {
        // SmallestPositive is the smallest positive floating-point number.
        static const double SmallestPositive = DBL_EPSILON * DBL_MIN;
    
        /*  Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
            something in [.75 ULP, 1.5 ULP) (even with rounding).
        */
        static const double Scale = 0.75 * DBL_EPSILON;
    
        q = fabs(q);
    
        // Handle denormals, and get the lowest normal exponent as a bonus.
        if (q < 2*DBL_MIN)
            return SmallestPositive;
    
        /*  Subtract from q something more than .5 ULP but less than 1.5 ULP.  That
            must produce q - 1 ULP.  Then subtract that from q, and we get 1 ULP.
    
            The significand 1 is of particular interest.  We subtract .75 ULP from
            q, which is midway between the greatest two floating-point numbers less
            than q.  Since we round to even, the lesser one is selected, which is
            less than q by 1 ULP of q, although 2 ULP of itself.
        */
        return q - (q - q * Scale);
    }
    

    The fabs and if can be replaced with ?:.

    For reference, the 2⌈log2 |p|⌉ algorithm is:

    q = p / FLT_EPSILON
    L = |(q+p) - q|
    if L = 0
        L = |p|
    
    0 讨论(0)
  • 2021-01-21 15:17

    For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.

    float y = x*(1+FLT_EPSILON)-x;
    if (y/FLT_EPSILON > x) y/=2;
    

    If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.

    Written as macros:

    #define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
    #define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))
    
    #define MSB_VAL(x) (ULP(x)/FLT_EPSILON)
    
    0 讨论(0)
提交回复
热议问题