How do you print the EXACT value of a floating point number?

后端未结

关注

 10  1195

First of all, this is not a floating point newbie question. I know results of floating point arithmetic (not to mention transcendental functions) usually cannot be represent

相关标签:

10条回答

隐瞒了意图╮

2020-11-27 04:17

This question has a bureaucratic part and an algorithmic part. A floating point number is stored internally as (2^e × m), where e is an exponent (itself in binary) and m is a mantissa. The bureaucratic part of the question is how to access this data, but R. seems more interested in the algorithmic part of the question, namely, converting (2^e × m) to a fraction (a/b) in decimal form. The answer to the bureaucratic question in several languages is frexp (which is an interesting detail that I didn’t know before today).

It is true that at first glance, it takes O(e²) work just to write 2^e in decimal, and more time still for the mantissa. But, thanks to the magic of the Schönhage–Strassen fast multiplication algorithm, you can do it in Õ(e) time, where the tilde means “up to log factors”. If you view Schönhage–Strassen as magic, then it’s not that hard to think of what to do. If e is even, you can recursively compute 2^e/2, and then square it using fast multiplication. On the other hand if e is odd, you can recursively compute 2^e−1 and then double it. You have to be careful to check that there is a version of Schönhage–Strassen in base 10. Although it is not widely documented, it can be done in any base.

Converting a very long mantissa from binary to base 10 is not exactly the same question, but it has a similar answer. You can divide the mantissa into two halves, m = a × 2^k + b. Then recursively convert a and b to base 10, convert 2^k to base 10, and do another fast multiplication to compute m in base 10.

The abstract result behind all of this is that you can convert integers from one base to another in Õ(N) time.

If the question is about standard 64-bit floating point numbers, then it’s too small for the fancy Schönhage–Strassen algorithm. In this range you can instead save work with various tricks. One approach is to store all 2048 values of 2^e in a lookup table, and then work in the mantissa with asymmetric multiplication (in between long multiplication and short multiplication). Another trick is to work in base 10000 (or a higher power of 10, depending on architecture) instead of base 10. But, as R. points out in the comments, 128-bit floating point numbers already allow large enough exponents to call into question both lookup tables and standard long multiplication. As a practical matter, long multiplication is the fastest up to a handful of digits, then in a significant medium range one can use Karatsuba multiplication or Toom–Cook multiplication, and then after that a variation of Schönhage–Strassen is best not just in theory but also in practice.

Actually, the big integer package GMP already has Õ(N)-time radix conversion, as well as good heuristics for which choice of multiplication algorithm. The only difference between their solution and mine is that instead of doing any big arithmetic in base 10, they compute large powers of 10 in base 2. In this solution, they also need fast division, but that can be obtained from fast multiplication in any of several ways.

0 讨论(0)
发布评论:

提交评论
- 加载中...

误落风尘

2020-11-27 04:19

There are 3 ways

printing numbers in bin or hex

This is the most precise way. I prefer hex because it is more like base 10 for reading/feel like for example F.8h = 15.5 no precision loss here.

printing in dec but only the relevant digits

With this I mean only digits which can have 1 in your number represented as close as possible.

num of integer digits are easy and precise (no precision loss):

// n10 - base 10 integer digits
// n2  - base 2 integer digits
n10=log10(2^n2)
n10=log2(2^n2)/log2(10)
n10=n2/log2(10)
n10=ceil(n2*0.30102999566398119521373889472449)
// if fist digit is 0 and n10 > 1 then n10--

num of fractional digits are more tricky and empirically I found this:

// n10 - base 10 fract. digits
// n2  - base 2 fract. digits >= 8
n10=0; if (n02==8) n10=1;
else if (n02==9) n10=2;
else if (n02> 9)
    {
    n10=((n02-9)%10);
         if (n10>=6) n10=2;
    else if (n10>=1) n10=1;
    n10+=2+(((n02-9)/10)*3);
    }

if you make a dependency table n02 <-> n10 then you see that constant 0.30102999566398119521373889472449 is still present, but at start from 8 bits because less cannot represent 0.1 with satisfactory precision (0.85 - 1.15). because of negative exponents of base 2 the dependency is not linear, instead it patterns. This code works for small bit count (<=52) but at larger bit counts there can be error because used pattern do not fit log10(2) or 1/log2(10) exactly.

for larger bit counts I use this:

n10=7.810+(9.6366363636363636363636*((n02>>5)-1.0));

but that formula is 32bit aligned !!! and also bigger bit count ads error to it

P.S. further analysis of binary representation of decadic numbers

0.1
0.01
0.001
0.0001
...

may reveal the exact pattern repetition which would lead to exact number of relevant digits for any bit count.

for clarity:

8 bin digits -> 1 dec digits
9 bin digits -> 2 dec digits
10 bin digits -> 3 dec digits
11 bin digits -> 3 dec digits
12 bin digits -> 3 dec digits
13 bin digits -> 3 dec digits
14 bin digits -> 3 dec digits
15 bin digits -> 4 dec digits
16 bin digits -> 4 dec digits
17 bin digits -> 4 dec digits
18 bin digits -> 4 dec digits
19 bin digits -> 5 dec digits
20 bin digits -> 6 dec digits
21 bin digits -> 6 dec digits
22 bin digits -> 6 dec digits
23 bin digits -> 6 dec digits
24 bin digits -> 6 dec digits
25 bin digits -> 7 dec digits
26 bin digits -> 7 dec digits
27 bin digits -> 7 dec digits
28 bin digits -> 7 dec digits
29 bin digits -> 8 dec digits
30 bin digits -> 9 dec digits
31 bin digits -> 9 dec digits
32 bin digits -> 9 dec digits
33 bin digits -> 9 dec digits
34 bin digits -> 9 dec digits
35 bin digits -> 10 dec digits
36 bin digits -> 10 dec digits
37 bin digits -> 10 dec digits
38 bin digits -> 10 dec digits
39 bin digits -> 11 dec digits
40 bin digits -> 12 dec digits
41 bin digits -> 12 dec digits
42 bin digits -> 12 dec digits
43 bin digits -> 12 dec digits
44 bin digits -> 12 dec digits
45 bin digits -> 13 dec digits
46 bin digits -> 13 dec digits
47 bin digits -> 13 dec digits
48 bin digits -> 13 dec digits
49 bin digits -> 14 dec digits
50 bin digits -> 15 dec digits
51 bin digits -> 15 dec digits
52 bin digits -> 15 dec digits
53 bin digits -> 15 dec digits
54 bin digits -> 15 dec digits
55 bin digits -> 16 dec digits
56 bin digits -> 16 dec digits
57 bin digits -> 16 dec digits
58 bin digits -> 16 dec digits
59 bin digits -> 17 dec digits
60 bin digits -> 18 dec digits
61 bin digits -> 18 dec digits
62 bin digits -> 18 dec digits
63 bin digits -> 18 dec digits
64 bin digits -> 18 dec digits

And at last do not forget to round the cut off digits !!! That means if digit after the last relevant digit is >=5 in dec than last relevant digit should be +1 ... and if it is already 9 than you must go to previous digit and so on...

print exact value

To print exact value of fractional binary number just print fractional n digits where n is the number of fractional bits because the value represented is sum of this values so the number of fractional decimals cannot be bigger than the num of fractional digits of LSB. Stuff above (bullet #2) is relevant for storing decimal number to float (or printing just relevant decimals).

negative powers of two exact values...

2^- 1 = 0.5
2^- 2 = 0.25
2^- 3 = 0.125
2^- 4 = 0.0625
2^- 5 = 0.03125
2^- 6 = 0.015625
2^- 7 = 0.0078125
2^- 8 = 0.00390625
2^- 9 = 0.001953125
2^-10 = 0.0009765625
2^-11 = 0.00048828125
2^-12 = 0.000244140625
2^-13 = 0.0001220703125
2^-14 = 0.00006103515625
2^-15 = 0.000030517578125
2^-16 = 0.0000152587890625
2^-17 = 0.00000762939453125
2^-18 = 0.000003814697265625
2^-19 = 0.0000019073486328125
2^-20 = 0.00000095367431640625

now negative powers of 10 printed by exact value style for 64 bit doubles:

10^+ -1 = 0.1000000000000000055511151231257827021181583404541015625
        = 0.0001100110011001100110011001100110011001100110011001101b
10^+ -2 = 0.01000000000000000020816681711721685132943093776702880859375
        = 0.00000010100011110101110000101000111101011100001010001111011b
10^+ -3 = 0.001000000000000000020816681711721685132943093776702880859375
        = 0.000000000100000110001001001101110100101111000110101001111111b
10^+ -4 = 0.000100000000000000004792173602385929598312941379845142364501953125
        = 0.000000000000011010001101101110001011101011000111000100001100101101b
10^+ -5 = 0.000010000000000000000818030539140313095458623138256371021270751953125
        = 0.000000000000000010100111110001011010110001000111000110110100011110001b
10^+ -6 = 0.000000999999999999999954748111825886258685613938723690807819366455078125
        = 0.000000000000000000010000110001101111011110100000101101011110110110001101b
10^+ -7 = 0.0000000999999999999999954748111825886258685613938723690807819366455078125
        = 0.0000000000000000000000011010110101111111001010011010101111001010111101001b
10^+ -8 = 0.000000010000000000000000209225608301284726753266340892878361046314239501953125
        = 0.000000000000000000000000001010101111001100011101110001000110000100011000011101b
10^+ -9 = 0.0000000010000000000000000622815914577798564188970686927859787829220294952392578125
        = 0.0000000000000000000000000000010001001011100000101111101000001001101101011010010101b
10^+-10 = 0.00000000010000000000000000364321973154977415791655470655996396089904010295867919921875
        = 0.00000000000000000000000000000000011011011111001101111111011001110101111011110110111011b
10^+-11 = 0.00000000000999999999999999939496969281939810930172340963650867706746794283390045166015625
        = 0.00000000000000000000000000000000000010101111111010111111111100001011110010110010010010101b
10^+-12 = 0.00000000000099999999999999997988664762925561536725284350612952266601496376097202301025390625
        = 0.00000000000000000000000000000000000000010001100101111001100110000001001011011110101000010001b
10^+-13 = 0.00000000000010000000000000000303737455634003709136034716842278413651001756079494953155517578125
        = 0.00000000000000000000000000000000000000000001110000100101110000100110100001001001011101101000001b
10^+-14 = 0.000000000000009999999999999999988193093545598986971343290729163921781719182035885751247406005859375
        = 0.000000000000000000000000000000000000000000000010110100001001001101110000110101000010010101110011011b
10^+-15 = 0.00000000000000100000000000000007770539987666107923830718560119501514549256171449087560176849365234375
        = 0.00000000000000000000000000000000000000000000000001001000000011101011111001111011100111010101100001011b
10^+-16 = 0.00000000000000009999999999999999790977867240346035618411149408467364363417573258630000054836273193359375
        = 0.00000000000000000000000000000000000000000000000000000111001101001010110010100101111101100010001001101111b
10^+-17 = 0.0000000000000000100000000000000007154242405462192450852805618492324772617063644020163337700068950653076171875
        = 0.0000000000000000000000000000000000000000000000000000000010111000011101111010101000110010001101101010010010111b
10^+-18 = 0.00000000000000000100000000000000007154242405462192450852805618492324772617063644020163337700068950653076171875
        = 0.00000000000000000000000000000000000000000000000000000000000100100111001001011101110100011101001001000011101011b
10^+-19 = 0.000000000000000000099999999999999997524592683526013185572915905567688179926555402943222361500374972820281982421875
        = 0.000000000000000000000000000000000000000000000000000000000000000111011000001111001001010011111011011011010010101011b
10^+-20 = 0.00000000000000000000999999999999999945153271454209571651729503702787392447107715776066783064379706047475337982177734375
        = 0.00000000000000000000000000000000000000000000000000000000000000000010111100111001010000100001100100100100100001000100011b

now negative powers of 10 printed by relevant decimal digits only style (i am more used to this) for 64bit doubles:

10^+ -1 = 0.1
10^+ -2 = 0.01
10^+ -3 = 0.001
10^+ -4 = 0.0001
10^+ -5 = 0.00001
10^+ -6 = 0.000001
10^+ -7 = 0.0000001
10^+ -8 = 0.00000001
10^+ -9 = 0.000000001
10^+-10 = 0.0000000001
10^+-11 = 0.00000000001
10^+-12 = 0.000000000001
10^+-13 = 0.0000000000001
10^+-14 = 0.00000000000001
10^+-15 = 0.000000000000001
10^+-16 = 0.0000000000000001
10^+-17 = 0.00000000000000001
10^+-18 = 0.000000000000000001
10^+-19 = 0.0000000000000000001
10^+-20 = 0.00000000000000000001

hope it helps :)

0 讨论(0)

灰色年华

2020-11-27 04:20

You don't. The closest you can come to that is dumping the bytes.

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-11-27 04:24

There's been a lot of work on printing floating-point numbers. The gold standard is to print out a decimal equivalent of minimal length such that when the decimal equivalent is read back in, you get the same floating-point number that you started with, no matter what the rounding mode is during readback. You can read about the algorithm in the excellent paper by Burger and Dybvig.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-11-27 04:28

Although it's C# and your question is tagged with C, Jon Skeet has code to convert a double to its exact representation as a string: http://www.yoda.arachsys.com/csharp/DoubleConverter.cs

From a quick glance, it does not appear to be too hard to port to C, and even easier to write in C++.

Upon further reflection, it appears that Jon's algorithm is also O(e^2), as it also loops over the exponent. However, that means the algorithm is O(log(n)^2) (where n is the floating-point number), and I'm not sure you can convert from base 2 to base 10 in better than log-squared time.

0 讨论(0)
发布评论:

提交评论
- 加载中...
陌清茗

2020-11-27 04:29
Off the top of my head, why not break the exponent down into a sum of binary exponents first, then all your operations are loss-less.

I.e.
```
10^2 = 2^6 + 2^5 + 2^2
```
Then sum:
```
mantissa<<6 + mantissa<<5 + mantissa<<2
```
I'm thinking that breaking it down would be on the O(n) on the the number of digits, the shifting is O(1), and the summing is O(n) digits...

You would have to have an integer class big enough to store the results, of course...

Let me know - I'm curious about this, it really made me think. :-)
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页