What is a good way to round double-precision values to a (somewhat) lower precision?

My problem is that I have to use a thrid-party function/algorithm which takes an array of double-precision values as input, but apparently can be sensitive to very small changes in the input data. However for my application I have to get identical results for inputs that are (almost) identical! In particular I have two test input arrays which are identical up to the 5-th position after the decimal point and still I get different results. So what causes the "problem" must be after the 5-th position after the decimal point.

Now my idea was to round the input to a slightly lower precision in order to get identical results from inputs that are very similar, yet not 100% identical. Therefore I am looking for a good/efficient way to round double-precision values to a slightly lower precision. So far I am using this code to round to the 9-th position after the decimal point:

double x = original_input();
x = double(qRound(x * 1000000000.0)) / 1000000000.0;

Here qRound() is the normal double to integer rounding function from Qt. This code works and it indeed resolved my problem with the two "problematic" test sets. But: Is there a more efficient way to this?

Also what bothers me: Rounding to the 9-th position after the decimal point might be reasonable for input data that is in the -100.0 to 100.0 range (as is the case with my current input data). But it may be too much (i,e, too much precision loss) for input data in the -0.001 to 0.001 range, for example. Unfortunately I don't know in what range my input values will be in other cases...

After all, I think what I would need is something like a function which does the following: Cut off, by proper rounding, a given double-precision value X to at most L-N positions after the decimal point, where L is the number of positions after the decimal point that double-precision can store (represent) for the given value; and N is fixed, like 3. It means that for "small" values we would allow more positions after the decimal point than for "large" values. In other words I would like to round the 64-Bit floating-point value to a (somewhat) smaller precision like 60-Bit or 56-Bit and then store it back to a 64-Bit double value.

Does this make sense to you? And if so, can you suggest a way to do this (efficiently) in C++ ???

Thanks in advance!

If you take a look at the double bit layout, you can see how to combine it with a bit of bitwise magic to implement fast (binary) rounding to arbitrary precision. You have the following bit layout:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFFFFFFFF

where S is the sign bit, the Es are exponent bits, and the Fs are fraction bits. You can make a bitmask like this:

11111111111111111111111.......1111000000

and bitwise-and (&) the two together. The result is a rounded version of the original input:

SEEEEEEEEEEEFFFFFFFFFFF.......FFFF000000

And you can control how much data is chopped off by changing the number of trailing zeros. More zeros = more rounding; fewer = less. You also get the other effect that you want: small input values are affected proportionally less that large input values, since what "place" each bit corresponds to is determined by the exponent.

Hope that helps!

Caveat: This is technically truncation rather than true rounding (all values will become closer to zero, regardless of how close they are to the other possible result), but hopefully it's just as useful in your case.

Thanks for the input so far.

However after some more searching, I came across frexp() and ldexp() functions! These functions give me access to the "mantissa" and "exponent" of a given double value and can also convert back from mantissa + exponent to a double. Now I just need to round the mantissa.

double value = original_input();
static const double FACTOR = 32.0;
int exponent;
double temp = double(round(frexp(value, &exponent) * FACTOR));
value = ldexp(temp / FACTOR, exponent);

I don't know if this is efficient at all, but it gives reasonable results:

0.000010000000000   0.000009765625000
0.000010100000000   0.000010375976563
0.000010200000000   0.000010375976563
0.000010300000000   0.000010375976563
0.000010400000000   0.000010375976563
0.000010500000000   0.000010375976563
0.000010600000000   0.000010375976563
0.000010700000000   0.000010986328125
0.000010800000000   0.000010986328125
0.000010900000000   0.000010986328125
0.000011000000000   0.000010986328125
0.000011100000000   0.000010986328125
0.000011200000000   0.000010986328125
0.000011300000000   0.000011596679688
0.000011400000000   0.000011596679688
0.000011500000000   0.000011596679688
0.000011600000000   0.000011596679688
0.000011700000000   0.000011596679688
0.000011800000000   0.000011596679688
0.000011900000000   0.000011596679688
0.000012000000000   0.000012207031250
0.000012100000000   0.000012207031250
0.000012200000000   0.000012207031250
0.000012300000000   0.000012207031250
0.000012400000000   0.000012207031250
0.000012500000000   0.000012207031250
0.000012600000000   0.000012817382813
0.000012700000000   0.000012817382813
0.000012800000000   0.000012817382813
0.000012900000000   0.000012817382813
0.000013000000000   0.000012817382813
0.000013100000000   0.000012817382813
0.000013200000000   0.000013427734375
0.000013300000000   0.000013427734375
0.000013400000000   0.000013427734375
0.000013500000000   0.000013427734375
0.000013600000000   0.000013427734375
0.000013700000000   0.000013427734375
0.000013800000000   0.000014038085938
0.000013900000000   0.000014038085938
0.000014000000000   0.000014038085938
0.000014100000000   0.000014038085938
0.000014200000000   0.000014038085938
0.000014300000000   0.000014038085938
0.000014400000000   0.000014648437500
0.000014500000000   0.000014648437500
0.000014600000000   0.000014648437500
0.000014700000000   0.000014648437500
0.000014800000000   0.000014648437500
0.000014900000000   0.000014648437500
0.000015000000000   0.000015258789063
0.000015100000000   0.000015258789063
0.000015200000000   0.000015258789063
0.000015300000000   0.000015869140625
0.000015400000000   0.000015869140625
0.000015500000000   0.000015869140625
0.000015600000000   0.000015869140625
0.000015700000000   0.000015869140625
0.000015800000000   0.000015869140625
0.000015900000000   0.000015869140625
0.000016000000000   0.000015869140625
0.000016100000000   0.000015869140625
0.000016200000000   0.000015869140625
0.000016300000000   0.000015869140625
0.000016400000000   0.000015869140625
0.000016500000000   0.000017089843750
0.000016600000000   0.000017089843750
0.000016700000000   0.000017089843750
0.000016800000000   0.000017089843750
0.000016900000000   0.000017089843750
0.000017000000000   0.000017089843750
0.000017100000000   0.000017089843750
0.000017200000000   0.000017089843750
0.000017300000000   0.000017089843750
0.000017400000000   0.000017089843750
0.000017500000000   0.000017089843750
0.000017600000000   0.000017089843750
0.000017700000000   0.000017089843750
0.000017800000000   0.000018310546875
0.000017900000000   0.000018310546875
0.000018000000000   0.000018310546875
0.000018100000000   0.000018310546875
0.000018200000000   0.000018310546875
0.000018300000000   0.000018310546875
0.000018400000000   0.000018310546875
0.000018500000000   0.000018310546875
0.000018600000000   0.000018310546875
0.000018700000000   0.000018310546875
0.000018800000000   0.000018310546875
0.000018900000000   0.000018310546875
0.000019000000000   0.000019531250000
0.000019100000000   0.000019531250000
0.000019200000000   0.000019531250000
0.000019300000000   0.000019531250000
0.000019400000000   0.000019531250000
0.000019500000000   0.000019531250000
0.000019600000000   0.000019531250000
0.000019700000000   0.000019531250000
0.000019800000000   0.000019531250000
0.000019900000000   0.000019531250000
0.000020000000000   0.000019531250000
0.000020100000000   0.000019531250000

Seems to like what I was looking for after all:

http://img833.imageshack.us/img833/9055/clipboard09.png

Now I just need to find good FACTOR value for my function....

Any comments or suggestions?

The business scenario is not evident from the question; still I feel you are trying to see the values are within an acceptable range. Rather than ==, you can check if the second value is within a certain % range (say +/- 0.001%)

If the range percentage cannot be fixed (mean, varies based on precision length; say, for 2 decimal places, 0.001 percent is fine but for 4 decimal 0.000001 percent is needed) then, you can arrive it by 1/mantissa.

I know that this question is quite old but I also searched for an approach to round double values to a lower precision. Maybe, this answer helps someone out there.

Imagine a floating point number in binary representation. For example 1101.101. The bits 1101 represent the integral part of the number and are weighted with 2^3, 2^2, 2^1, 2^0 from left to right. The bits 101 on the fractional part are weighted with 2^-1, 2^-2, 2^-3, which equals 1/2, 1/4, 1/8.

So what is the decimal error, you produce when you cut off your number two bits after the decimal point? It is 0.125 in this example, since the bit is set. If the bit would not be set, the error is 0. So, the error is <= 0.125.

Now think in a more general way: If you had an infinitely long mantissa, the fractional part converges to 1 (see here). In fact, you only have 52 bits (see here), so the sum is "almost" 1. So cutting off all fractional bits will cause an error of <= 1 which is not really a surprise! (Keep in mind, that your integral part also occupies mantissa space! But if you assume a number like 1.5which is 1.1 in binary representation, your mantissa only stores the part after the decimal point.)

Since cutting off all fractional bits causes an error of <= 1, cutting off all but the first bit right of the decimal point causes an error of <= 1/2 because this bit is weighted with 2^-1. Keeping a further bit decreases your error to <= 1/4.

This can be described by a function f(x) = 1/2^(52-x) where x is the number of cut off bits counted from the right side and y = f(x) is an upper bound of your resulting error.

Rounding by two places after the decimal point means "grouping" numbers by common hundredths. This can be done with the above function:1/100 >= 1/2^(52-x). This means that your resulting error is bounded by a hundredth when cutting off x bits. Solving this inequation by x yields: 52-log2(100) >= x where 52-log2(100) is 45.36. This means that cutting off not more than 45 bits ensures a "precision" of two decimal(!) positions after the floating point.

In general, your mantissa consists of an integral and a fractional part. Let's call their lengths i and f. Positive exponents describe i. Moreover 52=f+i holds. The solution of the above inequation changes to: 52-i-log2(10^n) >= x because after your fractional part is over, you have to stop cutting off the mantissa! (n is the decimal precision here.)

Applying logarithm rules, you can compute the number of maximum allowed bits to cut off like this:

x = f - (uint16_t) ceil(n / 0.3010299956639812); where the constant represents log10(2). Truncation can then be done with:

mantissa >>= x; mantissa <<= x;

If x is larger than f, remember to only shift by f. Otherwise, you will affect your mantissa's integral part.

来源：https://stackoverflow.com/questions/14150136/what-is-a-good-way-to-round-double-precision-values-to-a-somewhat-lower-precis

标签

c++

rounding

floating-point-precision