Why adding from biggest to smallest floating-point numbers is less accurate than adding from smallest to biggest?

余生长醉 提交于 2019-12-11 17:37:31

问题


My Java textbook states that adding from biggest to smallest is less accurate than adding from smallest to biggest when dealing with floating-point numbers. However, he doesn't go on to clearly explain why this is the case.


回答1:


Floating point has a limited number of digits of precision (6 for float, 15 for double). The calculation

1.0e20d + 1 

gives the result 1.0e20 because there is not enough precision to represent the number

100,000,000,000,000,000,001

If you start with the largest number then any numbers more than n orders of magnitude smaller (where n is 6 or 15 depending on type) will not contribute to the sum at all. Start with the smallest and you might sum several smaller numbers into one that will affect the final total.

Where it would make a difference is, for example

1.0e20 + 1.0e4 + 6.0e4 + 3.0e4

Assuming it's exactly 15 decimal digits precision (it's not, see the linked article below, but 15 is good enough for the example), if you start with the larger number, none of the others will make a difference because they're too small. If you start with the smaller ones, they add up to 1.0e5, which IS large enough to affect the final total.

Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic




回答2:


An excellent explanation is available in section 4.2 of "Accuracy and Stability of Numerical Algorithms" by Nick Higham. Below is my casual interpretation of this:

The key property of floating point is that when the result of an individual operation cannot be exactly represented, it is rounded to the nearest value. This has many consequences, namely that addition (and multiplication) is no longer associative.

The other main thing to note is that the error (the difference between the true value and the rounded value) is relative. If we use square brackets ([])to denote this rounding operation, then we have the property for any number x:

|[x] - x| <= ϵ |[x]| / 2

Where ϵ is the machine epsilon.

So suppose that we want to sum up [x1, x2, x3, x4]. The obvious way to do it is via

s2 = x1 + x2
s3 = s2 + x3 = x1 + x2 + x3
s4 = s3 + x4 = x1 + x2 + x3 + x4

As noted above, we can't do this exactly, so we're actually doing:

t2 = [x1 + x2]
t3 = [t2 + x3] = [[x1 + x2] + x3]
t4 = [t3 + x4] = [[[x1 + x2] + x3] +x4]

So how big is the resulting error |t4 - s4|? Well we know that

|t2 - s2| = |[x1+x2] - (x1+x2)| <= ϵ/2 |t2|

Now by the Triangle inequality we can write

|t3 - s3| =  |[t2+x3] - (t2+x3) + (t2+x3) - (s2+x3)| 
          <= |[t2+x3] - (t2+x3)| + |t2 - s2|
          <= ϵ/2 (|t3| + |t2|)

And again:

|t4 - s4| =  |[t3+x4] - (t3+x4) + (t3+x4) - (s3+x4)| 
          <= |[t3+x4] - (t3+x4)| + |t3 - s3|
          <= ϵ/2 (|t4| + |t3| + |t2|)

This leads to Higham's general advice:

In designing or choosing a summation method to achieve high accuracy, the aim should be to minimize the absolute values of the intermediate sums ti.

So if you're doing sequential summation (like we did above), then you want to start with the smallest elements, as that will give you the smallest intermediate sums.

But that is not the only option! There is also pairwise summation, where you add up pairs in a tree form (e.g. [[x1 + x2] + [x3 + x4]]), though this requires allocating a work array. You can also utilise SIMD vectorisation, by storing the intermediate sum in a vector, which can give both speed and accuracy improvements.



来源:https://stackoverflow.com/questions/48124692/why-adding-from-biggest-to-smallest-floating-point-numbers-is-less-accurate-than

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!