Is there a correct constant-expression, in terms of a float, for its msb?

后端未结

关注

 3  576

The problem: given a floating point constant expression, can we write a macro that evaluates to a constant expression whose value is a power of two equal to the most significant

相关标签:

3条回答

暖寄归人

2021-01-21 14:56
If you can assume IEEE 754 binary64 format and semantics (and in particular that arithmetic operations are correctly rounded), and a round-ties-to-even rounding mode, then it's a nice fact that for any not-too-small not-too-large positive finite double value x, the next representable value up from x is always given by x / 0x1.fffffffffffffp-1 (where 0x1.fffffffffffffp-1 is just 1.0 - 0.5 * DBL_EPSILON spelled out as a hex literal).

So we can get the most significant bit that you ask for simply from:
```
(x / 0x1.fffffffffffffp-1 - x) * 0x1.0p+52
```
And of course there are analogous results for float, assuming IEEE 754 binary32 format and semantics.

In fact, the only normal positive value that this fails for is DBL_MAX, where the result of the division overflows to infinity.

To show that the division trick works, it's enough to prove it for x in the range 1.0 <= x < 2.0; it's easy to show that for any x in this range, the value of x / 0x1.fffffffffffffp-1 - x (where / represents mathematical division in this case) lies in the half-open interval (2^-53, 2^52], and it follows that under round-ties-to-even (or in fact any round-to-nearest rounding mode), x / 0x1.fffffffffffffp-1 rounds up to the next representable value.

Similarly, under the same assumptions, x * 0x1.fffffffffffffp-1 is always the next representable value down from x.
0 讨论(0)
发布评论:

提交评论
- 加载中...

庸人自扰

2021-01-21 15:06

Here is code for finding the ULP. It was inspired by algorithm 3.5 in Accurate floating-Point Summation by Siegfriend M. Rump, Takeshi Ogita, and Shin’ichi Oishi (which calculates 2^{⌈log₂ |p|⌉}):

double ULP(double q)
{
    // SmallestPositive is the smallest positive floating-point number.
    static const double SmallestPositive = DBL_EPSILON * DBL_MIN;

    /*  Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
        something in [.75 ULP, 1.5 ULP) (even with rounding).
    */
    static const double Scale = 0.75 * DBL_EPSILON;

    q = fabs(q);

    // Handle denormals, and get the lowest normal exponent as a bonus.
    if (q < 2*DBL_MIN)
        return SmallestPositive;

    /*  Subtract from q something more than .5 ULP but less than 1.5 ULP.  That
        must produce q - 1 ULP.  Then subtract that from q, and we get 1 ULP.

        The significand 1 is of particular interest.  We subtract .75 ULP from
        q, which is midway between the greatest two floating-point numbers less
        than q.  Since we round to even, the lesser one is selected, which is
        less than q by 1 ULP of q, although 2 ULP of itself.
    */
    return q - (q - q * Scale);
}

The fabs and if can be replaced with ?:.

For reference, the 2^{⌈log₂ |p|⌉} algorithm is:

q = p / FLT_EPSILON
L = |(q+p) - q|
if L = 0
    L = |p|

0 讨论(0)

离开以前

2021-01-21 15:17
For the sake of example, assume the type is float and let x be the input. Initially I will write this as a sequence of statements for readability, but they can be translated directly into macros that produce constant expressions.
```
float y = x*(1+FLT_EPSILON)-x;
if (y/FLT_EPSILON > x) y/=2;
```
If we could ensure rounding-down, the initial value of y should be exactly what we want. However, if the top two bits of x are 1 and any lower bits are set, or if we hit a rounds-to-even case, x*(1+FLT_EPSILON) could exceed x by 2 units in the last place instead of just 1. I don't believe any other cases are possible, and I believe the second line accounts fully for this one.

Written as macros:
```
#define PRE_ULP(x) ((x)*(1+FLT_EPSILON)-(x))
#define ULP(x) ((PRE_ULP(x)/FLT_EPSILON>(x) ? PRE_ULP(x)/2 : PRE_ULP(x))

#define MSB_VAL(x) (ULP(x)/FLT_EPSILON)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...