simple floating-point numbers lose precision

雨燕双飞 提交于 2019-12-04 07:32:42

Required reading: What Every Computer Scientist Should Know About Floating-Point Arithmetic, David Goldberg.

The issue is not one of precision. Rather the issue is one of representability. First of all, let us re-cap that floating point numbers are used to represent real numbers. There are an infinite quantity of real numbers. Of course, the same can be said of integers. But the difference here is that within a particular range, there are a finite number of integers but an infinite number of real numbers. Indeed as was originally shown by Cantor, any finite interval of real numbers contains an uncountable number of real values.

So it is clear that we cannot represent all real numbers on a finite machine. So, which numbers can we represent? Well, that depends on the data type. Delphi floating point data types use binary representation. The single (32 bit) and double (64 bit) types adhere to the IEEE-754 standard. The extended (80 bit) type is an Intel specific type. In binary floating point a representable number has the form k2n where k and n are integers. Note that I am not claiming that all numbers of this form are representable. That is not possible because there are an infinite quantity of such numbers. Rather my point is that all representable numbers are of this form.

Some examples of representable binary floating point numbers include: 1, 0.5, 0.25, 0.75, 1.25, 0.125, 0.375. Your value, 3.7, is not representable as a binary floating point value.

What this means in relation to your code is that none of it is doing what you expect it to do. You are hoping to compare against the value 3.7. But instead you are comparing against the nearest exactly representably value to 3.7. As a matter of implementation detail, this nearest exactly representably value is in the context of extended precision. Which is why it appears that the version using extended does what you expect. However, do not take this to mean that your variable x is equal to 3.7. In fact it is equal to the nearest representable extended precision value to 3.7.

Rob Kennedy's most useful website can show you the closest representable values to a specific number. In the case of 3.7 these are:

3.7 = + 3.70000 00000 00000 00004 33680 86899 42017 73602 98112 03479 76684 57031 25
3.7 = + 3.70000 00000 00000 17763 56839 40025 04646 77810 66894 53125
3.7 = + 3.70000 00476 83715 82031 25

These are presented in the order extended, double, single. In other words these are the values of your variables x, d and s respectively.

If you look at these values, and compare them with the closest extended to 3.7 you will see why your program produces the output that it does. Both the single and double precision values here are greater than the extended. Which is what your program told you.

I don't want to make any blanket recommendations as to how to compare floating point values. The best way to do that always depends very critically on the specific problem. No blanket advice can be usefully given.

Short answer: 0.7 cannot be represented exactly (binary floating point values are always fractions with denominator that is a power of 2.); the precision of the data type you're storing it in (and the one the compiler selects for the type of the constant you're comparing them to) can affect the representation of that number and have an effect on the comparison.

Moral: Never directly compare two floating point values for equality unless they're exactly the same data type and assigned the same exact value.

Obligatory link: What Every Computer Scientist Should Know About Floating-Point Arithmetic

Another link that might be helpful is to Delphi's Math.SameValue function, that allows you to compare two floating point values for approximate equality depending on a specific allowable delta (difference).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!