Sum of two “np.longdouble”s yields big numerical error

北城余情 提交于 2019-12-10 21:35:32

问题


Good morning,

I'm reading two numbers from a FITS file (representing the integer and floating point parts of a single number), converting them to long doubles (128 bit in my machine), and then summing them up.

The result is not as precise as I would expect from using 128-bit floats. Here is the code:

a_int = np.longdouble(read_header_key(fits_file, 'I'))
print "I %.25f" % a_int, type(a_int)
a_float = np.longdouble(read_header_key(fits_file, 'F'))
print "F %.25f" % a_float, a_float.dtype
a = a_int + a_float
print "TOT %.25f" % a, a.dtype

and here's the answer I get:

I 55197.0000000000000000000000000 <type 'numpy.float128'>
F 0.0007660185200000000195833 float128
TOT 55197.0007660185219720005989075 float128

The result departs from what I would expect(55197.0007660185200000000195833) after just 11 decimal digits (16 significant digits in total). I would expect a much better precision from 128bit floats. What am I doing wrong?

This result was reproduced on a Mac machine and on a Linux 32bit machine (in that case, the dtype was float96, but the values were exactly the same)

Thanks in advance for your help!

Matteo


回答1:


The problem lies in your printing of the np.longdouble. When you format using %f, Python casts the result to a float (64-bits) before printing.

Here:

>>> a_int = np.longdouble(55197)
>>> a_float = np.longdouble(76601852) / 10**11
>>> b = a_int + a_float
>>> '%.25f' % b
'55197.0007660185219720005989075'
>>> '%.25f' % float(b)
'55197.0007660185219720005989075'
>>> b * 10**18
5.5197000766018519998e+22

Note that on my machine, I only get a bit more precision with longdouble compared with ordinary double (20 decimal places instead of 15). So, it may be worth seeing if the Decimal module might be more suited for your application. Decimal handles arbitrary-precision decimal floating-point numbers with no loss of precision.




回答2:


My guess is that the %f modifier constructs a float from your longdouble object and uses that when creating the format string.

>>> import numpy as np
>>> np.longdouble(55197)
55197.0
>>> a = np.longdouble(55197)
>>> b = np.longdouble(0.0007660185200000000195833)
>>> a
55197.0
>>> b
0.00076601852000000001958
>>> a + b
55197.00076601852
>>> type(a+b)
<type 'numpy.float128'>
>>> a + b == 55197.00076601852
False

As a side note, even repr doesn't print enough digets to reconstruct the object. This is simply because you can't have a float literal which is sufficient to pass to your longdouble.



来源:https://stackoverflow.com/questions/14876165/sum-of-two-np-longdoubles-yields-big-numerical-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!