Accuracy of floating point arithmetic

旧巷老猫 提交于 2019-11-29 13:30:54

In a nutshell

You say that your compiler is Visual C++ 2010 Express. I do not have access to this compiler, but I understand that it generates programs that initially configure the x87 CPU to use 53 bits of precision, in order to emulate IEEE 754 double-precision computations as closely as possible.

Unfortunately, “as closely as possible” is not always close enough. Historical 80-bit floating-point registers can have their significand limited in width for the purpose of emulating double-precision, but they always retain a full range for the exponent. The difference shows in particular when manipulating denormals (like your y).

What happens

My explanation would be that in printf("%23.16e\n", 1.6*y);, 1.6*y is computed as a 80-bit reduced-significand full-exponent number (it is thus a normal number), then converted to IEEE 754 double-precision (resulting in a denormal), then printed.

On the other hand, in printf("%23.16e\n", x + 1.6*y);, x + 1.6*y is computed with all 80-bit reduced-significand full-exponent numbers (again all intermediate results are normal numbers), then converted to IEEE 754 double-precision, then printed.

This would explain why 1.6*y prints the same as 2.0*y but has a different effect when added to x. The number that is printed is a double-precision denormal. The number that is added to x is a 80-bit reduced-significand full-exponent normal number (not the same one).

What happens with other compilers when generating x87 instructions

Other compilers, like GCC, do not configure the x87 FPU to manipulate 53-bit significands. This can have the same consequences (in this case x + 1.6*y would be computed with all 80-bit full significand full exponent numbers, and then converted to double-precision for printing or storing in memory). In this case, the issue is noticeable even more often (you do not need to involve denormals or infinite numbers to notice differences).

This article by David Monniaux contains all the details you may wish for and more.

Removing the unwanted behavior

To get rid of the problem (if you consider it to be one), find the flag that tells your compiler to generate SSE2 instructions for floating-point. These implement exactly IEEE 754 semantics for single- and double-precision.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!