问题
I would like to introduce some artificial precision loss into two numbers being compared to smooth out minor rounding errors so that I don't have to use the Math.abs(x - y) < eps
idiom in every comparison involving x
and y
.
Essentially, I want something that behaves similarly to down-casting a double
to a float
and then up-casting it back to a double
, except I want to also preserve very large and very small exponents and I want some control over the number of significand bits preserved.
Given the following function that produces the binary representation of the significand of a 64-bit IEEE 754 number:
public static String significand(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH);
}
I want a function reducePrecision(double x, int bits)
that reduces the precision of the significand of a double
such that:
significand(reducePrecision(x, bits)).substring(bits).equals(String.format("%0" + (52 - bits) + "d", 0))
In other words, every bit after the bits
-most significant bit in the significand of reducePrecision(x, bits)
should be 0, while the bits
-most significant bits in the significand of reducePrecision(x, bits)
should reasonably approximate the bits
-most signicant bits in the significand of x
.
回答1:
Suppose x
is the number you wish to reduce the precision of and bits
is the number of significant bits you wish to retain.
When bits
is sufficiently large and the order of magnitude of x
is sufficiently close to 0, then x * (1L << (bits - Math.getExponent(x)))
will scale x
so that the bits that need to be removed will appear in the fractional component (after the radix point) while the bits that will be retained will appear in the integer component (before the radix point). You can then round this to remove the fractional component and then divide the rounded number by (1L << (bits - Math.getExponent(x)))
to restore the order of magnitude of x
, i.e.:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * (1L << exponent)) / (1L << exponent);
}
However, (1L << exponent)
will break down when Math.getExponent(x) > bits || Math.getExponent(x) < bits - 62
. The solution is to use Math.pow(2, exponent)
(or the fast pow2(exponent)
implementation from this answer) to calculate a fractional, or a very large, power of 2, i.e.:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.round(x * Math.pow(2, exponent)) * Math.pow(2, -exponent);
}
However, Math.pow(2, exponent)
will break down as exponent
approaches -1074 or +1023. The solution is to use Math.scalb(x, exponent)
so that the power of 2 doesn't have to be explicitly calculated, i.e.:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.round(Math.scalb(x, exponent)), -exponent);
}
However, Math.round(y)
returns a long
so it does not preserve Infinity
, NaN
, and cases where Math.abs(x) > Long.MAX_VALUE / Math.pow(2, exponent)
. Furthermore, Math.round(y)
always rounds ties to positive infinity (e.g. Math.round(0.5) == 1 && Math.round(1.5) == 2
). The solution is to use Math.rint(y)
to receive a double
and preserve the unbiased IEEE 754 round-to-nearest, ties-to-even rule (e.g. Math.rint(0.5) == 0.0 && Math.rint(1.5) == 2.0
), i.e.:
public static double reducePrecision(double x, int bits) {
int exponent = bits - Math.getExponent(x);
return Math.scalb(Math.rint(Math.scalb(x, exponent)), -exponent);
}
Finally, here is a unit test confirming our expectations:
public static String decompose(double d) {
int SIGN_WIDTH = 1;
int EXP_WIDTH = 11;
int SIGNIFICAND_WIDTH = 53;
String s = String.format("%64s", Long.toBinaryString(Double.doubleToRawLongBits(d))).replace(' ', '0');
return s.substring(0, 0 + SIGN_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH) + " "
+ s.substring(0 + SIGN_WIDTH + EXP_WIDTH, 0 + SIGN_WIDTH + EXP_WIDTH + SIGNIFICAND_WIDTH - 1);
}
public static void test() {
// Use a fixed seed so the generated numbers are reproducible.
java.util.Random r = new java.util.Random(0);
// Generate a floating point number that makes use of its full 52 bits of significand precision.
double a = r.nextDouble() * 100;
System.out.println(decompose(a) + " " + a);
Assert.assertFalse(decompose(a).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// Cast the double to a float to produce a "ground truth" of precision loss to compare against.
double b = (float) a;
System.out.println(decompose(b) + " " + b);
Assert.assertTrue(decompose(b).split(" ")[2].substring(23).equals(String.format("%0" + (52 - 23) + "d", 0)));
// 32-bit float has a 23 bit significand, so c's bit pattern should be identical to b's bit pattern.
double c = reducePrecision(a, 23);
System.out.println(decompose(c) + " " + c);
Assert.assertTrue(b == c);
// 23rd-most significant bit in c is 1, so rounding it to the 22nd-most significant bit requires breaking a tie.
// Since 22nd-most significant bit in c is 0, d will be rounded down so that its 22nd-most significant bit remains 0.
double d = reducePrecision(c, 22);
System.out.println(decompose(d) + " " + d);
Assert.assertTrue(decompose(d).split(" ")[2].substring(22).equals(String.format("%0" + (52 - 22) + "d", 0)));
Assert.assertTrue(decompose(c).split(" ")[2].charAt(22) == '1' && decompose(c).split(" ")[2].charAt(21) == '0');
Assert.assertTrue(decompose(d).split(" ")[2].charAt(21) == '0');
// 21st-most significant bit in d is 1, so rounding it to the 20th-most significant bit requires breaking a tie.
// Since 20th-most significant bit in d is 1, e will be rounded up so that its 20th-most significant bit becomes 0.
double e = reducePrecision(c, 20);
System.out.println(decompose(e) + " " + e);
Assert.assertTrue(decompose(e).split(" ")[2].substring(20).equals(String.format("%0" + (52 - 20) + "d", 0)));
Assert.assertTrue(decompose(d).split(" ")[2].charAt(20) == '1' && decompose(d).split(" ")[2].charAt(19) == '1');
Assert.assertTrue(decompose(e).split(" ")[2].charAt(19) == '0');
// Reduce the precision of a number close to the largest normal number.
double f = reducePrecision(a * 0x1p+1017, 23);
System.out.println(decompose(f) + " " + f);
// Reduce the precision of a number close to the smallest normal number.
double g = reducePrecision(a * 0x1p-1028, 23);
System.out.println(decompose(g) + " " + g);
// Reduce the precision of a number close to the smallest subnormal number.
double h = reducePrecision(a * 0x1p-1051, 23);
System.out.println(decompose(h) + " " + h);
}
And its output:
0 10000000101 0010010001100011000110011111011100100100111000111011 73.0967787376657
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110100000000000000000000000000000 73.0967788696289
0 10000000101 0010010001100011000110000000000000000000000000000000 73.09677124023438
0 10000000101 0010010001100011001000000000000000000000000000000000 73.0968017578125
0 11111111110 0010010001100011000110100000000000000000000000000000 1.0266060746443803E308
0 00000000001 0010010001100011000110100000000000000000000000000000 2.541339559435826E-308
0 00000000000 0000000000000000000000100000000000000000000000000000 2.652494739E-315
来源:https://stackoverflow.com/questions/48727424/how-do-i-truncate-the-significand-of-a-floating-point-number-to-an-arbitrary-pre