Fast float quantize, scaled by precision?

前端 未结 1 427
执念已碎
执念已碎 2021-01-15 14:06

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.

1条回答
  •  醉梦人生
    2021-01-15 14:33

    The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.

    If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2b in some way.

    This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.

    This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)

    void Split(double *x0, double *x1, double x)
    {
        double d = x * (Scale + 1);
        double t = d - x;
        *x0 = d - t;
        *x1 = x - *x0;
    }
    

    0 讨论(0)
提交回复
热议问题