What is a good way to round double-precision values to a (somewhat) lower precision?

痴心易碎 提交于 2019-12-05 05:05:33

If you take a look at the double bit layout, you can see how to combine it with a bit of bitwise magic to implement fast (binary) rounding to arbitrary precision. You have the following bit layout:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFFFFFFFF

where S is the sign bit, the Es are exponent bits, and the Fs are fraction bits. You can make a bitmask like this:

11111111111111111111111.......1111000000

and bitwise-and (&) the two together. The result is a rounded version of the original input:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFF000000

And you can control how much data is chopped off by changing the number of trailing zeros. More zeros = more rounding; fewer = less. You also get the other effect that you want: small input values are affected proportionally less that large input values, since what "place" each bit corresponds to is determined by the exponent.

Hope that helps!

Caveat: This is technically truncation rather than true rounding (all values will become closer to zero, regardless of how close they are to the other possible result), but hopefully it's just as useful in your case.

Thanks for the input so far.

However after some more searching, I came across frexp() and ldexp() functions! These functions give me access to the "mantissa" and "exponent" of a given double value and can also convert back from mantissa + exponent to a double. Now I just need to round the mantissa.

double value = original_input();
static const double FACTOR = 32.0;
int exponent;
double temp = double(round(frexp(value, &exponent) * FACTOR));
value = ldexp(temp / FACTOR, exponent);

I don't know if this is efficient at all, but it gives reasonable results:

0.000010000000000   0.000009765625000
0.000010100000000   0.000010375976563
0.000010200000000   0.000010375976563
0.000010300000000   0.000010375976563
0.000010400000000   0.000010375976563
0.000010500000000   0.000010375976563
0.000010600000000   0.000010375976563
0.000010700000000   0.000010986328125
0.000010800000000   0.000010986328125
0.000010900000000   0.000010986328125
0.000011000000000   0.000010986328125
0.000011100000000   0.000010986328125
0.000011200000000   0.000010986328125
0.000011300000000   0.000011596679688
0.000011400000000   0.000011596679688
0.000011500000000   0.000011596679688
0.000011600000000   0.000011596679688
0.000011700000000   0.000011596679688
0.000011800000000   0.000011596679688
0.000011900000000   0.000011596679688
0.000012000000000   0.000012207031250
0.000012100000000   0.000012207031250
0.000012200000000   0.000012207031250
0.000012300000000   0.000012207031250
0.000012400000000   0.000012207031250
0.000012500000000   0.000012207031250
0.000012600000000   0.000012817382813
0.000012700000000   0.000012817382813
0.000012800000000   0.000012817382813
0.000012900000000   0.000012817382813
0.000013000000000   0.000012817382813
0.000013100000000   0.000012817382813
0.000013200000000   0.000013427734375
0.000013300000000   0.000013427734375
0.000013400000000   0.000013427734375
0.000013500000000   0.000013427734375
0.000013600000000   0.000013427734375
0.000013700000000   0.000013427734375
0.000013800000000   0.000014038085938
0.000013900000000   0.000014038085938
0.000014000000000   0.000014038085938
0.000014100000000   0.000014038085938
0.000014200000000   0.000014038085938
0.000014300000000   0.000014038085938
0.000014400000000   0.000014648437500
0.000014500000000   0.000014648437500
0.000014600000000   0.000014648437500
0.000014700000000   0.000014648437500
0.000014800000000   0.000014648437500
0.000014900000000   0.000014648437500
0.000015000000000   0.000015258789063
0.000015100000000   0.000015258789063
0.000015200000000   0.000015258789063
0.000015300000000   0.000015869140625
0.000015400000000   0.000015869140625
0.000015500000000   0.000015869140625
0.000015600000000   0.000015869140625
0.000015700000000   0.000015869140625
0.000015800000000   0.000015869140625
0.000015900000000   0.000015869140625
0.000016000000000   0.000015869140625
0.000016100000000   0.000015869140625
0.000016200000000   0.000015869140625
0.000016300000000   0.000015869140625
0.000016400000000   0.000015869140625
0.000016500000000   0.000017089843750
0.000016600000000   0.000017089843750
0.000016700000000   0.000017089843750
0.000016800000000   0.000017089843750
0.000016900000000   0.000017089843750
0.000017000000000   0.000017089843750
0.000017100000000   0.000017089843750
0.000017200000000   0.000017089843750
0.000017300000000   0.000017089843750
0.000017400000000   0.000017089843750
0.000017500000000   0.000017089843750
0.000017600000000   0.000017089843750
0.000017700000000   0.000017089843750
0.000017800000000   0.000018310546875
0.000017900000000   0.000018310546875
0.000018000000000   0.000018310546875
0.000018100000000   0.000018310546875
0.000018200000000   0.000018310546875
0.000018300000000   0.000018310546875
0.000018400000000   0.000018310546875
0.000018500000000   0.000018310546875
0.000018600000000   0.000018310546875
0.000018700000000   0.000018310546875
0.000018800000000   0.000018310546875
0.000018900000000   0.000018310546875
0.000019000000000   0.000019531250000
0.000019100000000   0.000019531250000
0.000019200000000   0.000019531250000
0.000019300000000   0.000019531250000
0.000019400000000   0.000019531250000
0.000019500000000   0.000019531250000
0.000019600000000   0.000019531250000
0.000019700000000   0.000019531250000
0.000019800000000   0.000019531250000
0.000019900000000   0.000019531250000
0.000020000000000   0.000019531250000
0.000020100000000   0.000019531250000

Seems to like what I was looking for after all:

http://img833.imageshack.us/img833/9055/clipboard09.png

Now I just need to find good FACTOR value for my function....

Any comments or suggestions?

The business scenario is not evident from the question; still I feel you are trying to see the values are within an acceptable range. Rather than ==, you can check if the second value is within a certain % range (say +/- 0.001%)

If the range percentage cannot be fixed (mean, varies based on precision length; say, for 2 decimal places, 0.001 percent is fine but for 4 decimal 0.000001 percent is needed) then, you can arrive it by 1/mantissa.

I know that this question is quite old but I also searched for an approach to round double values to a lower precision. Maybe, this answer helps someone out there.

Imagine a floating point number in binary representation. For example 1101.101. The bits 1101 represent the integral part of the number and are weighted with 2^3, 2^2, 2^1, 2^0 from left to right. The bits 101 on the fractional part are weighted with 2^-1, 2^-2, 2^-3, which equals 1/2, 1/4, 1/8.

So what is the decimal error, you produce when you cut off your number two bits after the decimal point? It is 0.125 in this example, since the bit is set. If the bit would not be set, the error is 0. So, the error is <= 0.125.

Now think in a more general way: If you had an infinitely long mantissa, the fractional part converges to 1 (see here). In fact, you only have 52 bits (see here), so the sum is "almost" 1. So cutting off all fractional bits will cause an error of <= 1 which is not really a surprise! (Keep in mind, that your integral part also occupies mantissa space! But if you assume a number like 1.5which is 1.1 in binary representation, your mantissa only stores the part after the decimal point.)

Since cutting off all fractional bits causes an error of <= 1, cutting off all but the first bit right of the decimal point causes an error of <= 1/2 because this bit is weighted with 2^-1. Keeping a further bit decreases your error to <= 1/4.

This can be described by a function f(x) = 1/2^(52-x) where x is the number of cut off bits counted from the right side and y = f(x) is an upper bound of your resulting error.

Rounding by two places after the decimal point means "grouping" numbers by common hundredths. This can be done with the above function:1/100 >= 1/2^(52-x). This means that your resulting error is bounded by a hundredth when cutting off x bits. Solving this inequation by x yields: 52-log2(100) >= x where 52-log2(100) is 45.36. This means that cutting off not more than 45 bits ensures a "precision" of two decimal(!) positions after the floating point.

In general, your mantissa consists of an integral and a fractional part. Let's call their lengths i and f. Positive exponents describe i. Moreover 52=f+i holds. The solution of the above inequation changes to: 52-i-log2(10^n) >= x because after your fractional part is over, you have to stop cutting off the mantissa! (n is the decimal precision here.)

Applying logarithm rules, you can compute the number of maximum allowed bits to cut off like this:

x = f - (uint16_t) ceil(n / 0.3010299956639812); where the constant represents log10(2). Truncation can then be done with:

mantissa >>= x; mantissa <<= x;

If x is larger than f, remember to only shift by f. Otherwise, you will affect your mantissa's integral part.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!