问题
Good morning,
I'm reading two numbers from a FITS file (representing the integer and floating point parts of a single number), converting them to long doubles (128 bit in my machine), and then summing them up.
The result is not as precise as I would expect from using 128-bit floats. Here is the code:
a_int = np.longdouble(read_header_key(fits_file, 'I'))
print "I %.25f" % a_int, type(a_int)
a_float = np.longdouble(read_header_key(fits_file, 'F'))
print "F %.25f" % a_float, a_float.dtype
a = a_int + a_float
print "TOT %.25f" % a, a.dtype
and here's the answer I get:
I 55197.0000000000000000000000000 <type 'numpy.float128'>
F 0.0007660185200000000195833 float128
TOT 55197.0007660185219720005989075 float128
The result departs from what I would expect(55197.0007660185200000000195833) after just 11 decimal digits (16 significant digits in total). I would expect a much better precision from 128bit floats. What am I doing wrong?
This result was reproduced on a Mac machine and on a Linux 32bit machine (in that case, the dtype was float96, but the values were exactly the same)
Thanks in advance for your help!
Matteo
回答1:
The problem lies in your printing of the np.longdouble
. When you format using %f
, Python casts the result to a float (64-bits) before printing.
Here:
>>> a_int = np.longdouble(55197)
>>> a_float = np.longdouble(76601852) / 10**11
>>> b = a_int + a_float
>>> '%.25f' % b
'55197.0007660185219720005989075'
>>> '%.25f' % float(b)
'55197.0007660185219720005989075'
>>> b * 10**18
5.5197000766018519998e+22
Note that on my machine, I only get a bit more precision with longdouble
compared with ordinary double
(20 decimal places instead of 15). So, it may be worth seeing if the Decimal
module might be more suited for your application. Decimal
handles arbitrary-precision decimal floating-point numbers with no loss of precision.
回答2:
My guess is that the %f
modifier constructs a float from your longdouble object and uses that when creating the format string.
>>> import numpy as np
>>> np.longdouble(55197)
55197.0
>>> a = np.longdouble(55197)
>>> b = np.longdouble(0.0007660185200000000195833)
>>> a
55197.0
>>> b
0.00076601852000000001958
>>> a + b
55197.00076601852
>>> type(a+b)
<type 'numpy.float128'>
>>> a + b == 55197.00076601852
False
As a side note, even repr
doesn't print enough digets to reconstruct the object. This is simply because you can't have a float literal which is sufficient to pass to your longdouble
.
来源:https://stackoverflow.com/questions/14876165/sum-of-two-np-longdoubles-yields-big-numerical-error