Result of the sum of random-ordered IEEE 754 double precision floats

问题

Here is a pseudocode of my problem.

I have an array of IEEE 754 double precision positive numbers.

The array can come in a random order but numbers are always the same, just scrambled in their positions. Also these numbers can vary in a very wide range in the valid IEEE range of the double representation.

Once I have the list, I initialize a variable:

double sum_result = 0.0;

And I accumulate the sum on sum_result, in a loop over the whole array. At each step I do:

sum_result += my_double_array[i]

Is it guaranteed that, whatever the order of the initial array of double, if the numbers are the same, the printed out sum result will be always the same?

回答1:

Is it guaranteed that, whatever the order of the initial array of double, if the numbers are the same, the printed out sum result will be always the same?

No, FP addition is not associative. Remember it is called floating point - the absolute precision "floats" about relative to 1.0. Any given operation like addition (+) is subject to round-off error.

Yet if the sum is done and the inexact flag is clear, then yes, the order was not relevant.**

Simple counter example.

#include <math.h>
#include <float.h>
#include <fenv.h>
#include <stdio.h>

int main(void) {
  double a[3] = { DBL_MAX, -DBL_MAX, 1.0 };
  fexcept_t flag;

  feclearexcept(FE_ALL_EXCEPT);
  printf("%e\n", (a[0] + a[1]) + a[2]);
  fegetexceptflag(&flag, FE_INEXACT);
  printf("Inexact %d\n", !!(flag & FE_INEXACT));

  feclearexcept(FE_ALL_EXCEPT);
  printf("%e\n", a[0] + (a[1] + a[2]));
  fegetexceptflag(&flag, FE_INEXACT);
  printf("Inexact %d\n", !!(flag & FE_INEXACT));

  printf("%d\n", FLT_EVAL_METHOD);
  return (EXIT_SUCCESS);
}

Output

1.000000e+00  // Sum is exact
Inexact 0

0.000000e+00  // Sum is inexact
Inexact 1

0    // evaluate all operations ... just to the range and precision of the type;

Depending on FLT_EVAL_METHOD, FP math may use wider precession and range, yet the above extreme example sums will still differ.

** aside from maybe a result of 0.0 vs -0.0

To see why, try a based 10 text example with 4 digits of precision. The same principle applies to double with its usual 53 binary digits of precision.

a[3] = +1.000e99, -1.000e99, 1.000
sum = a[0] + a[1]   // sum now exactly 0.0 
sum += a[2]         // sum now exactly 1.0 
// vs.
sum = a[1] + a[2]   // sum now inexactly -1.000e99
sum += a[0]         // sum now inexactly 0.0

Re: "printed out sum result will be always the same" : Unless code prints with "%a" or "%.*e" with higher enough precision, the text printed may lack significance and two different sums may look the same. See Printf width specifier to maintain precision of floating-point value

回答2:

No.

As a simple example, adding 1 to 0x1p53 yields 0x1p53. (This uses hexadecimal floating-point notation. The part before the “p” is the significand, expressed in hexadecimal the same as a C hexadecimal integer constant, except that it may have a “.” in it to mark the start of a fractional part. The number following the “p” represents a power of two by which the significand is multiplied.) This is because the mathematically exact sum, 0x1.00000000000008p+53, cannot be represented in IEEE-754 64-bit binary floating-point, so it is rounded to the nearest value with an even low bit in its significand, which is 0x1p53.

Thus, 0x1p53+1 yields 0x1p53. So 0x1p53+1+1, evaluated left to right, also yields 0x1p53. However, 1+1 is 2, and 2+0x1p53 is exactly representable, as 0x1.0000000000001p+53, so 1+1+0x1p53 is 0x1.0000000000001p+53.

To show a more easily visualizable example in decimal, suppose we have only two decimal digits. Then 100+1 yields 100, so 100+1+1+1+1+1+1 yields 100. But 1+1+1+1+1+1+100 accumulates to 6+100 which then yields 110 (due to rounding to two significant digits).

回答3:

Let's just take an example: I'm transposing the floating point problem using a model in base 10 with only 2 significant digits to make it simple, operation result being rounded to nearest.

Say we must sum the 3 numbers 9.9 + 8.4 + 1.4
The exact result is 19.7 but we have only two digits so it hould be rounded to 20.

If we first sum 9.9 + 8.4 we get 18.3 which is then rounded to 18.
We then sum 18. + 1.4 we get 19.4 rounded to 19..

If we first sum the last two terms 8.4 + 1.4 we get 9.8, no rounding required yet.
Then 9.9 + 9.8 we get 19.7 rounded to 20., a different result.

(9.9 + 8.4) + 1.4 differs from 9.9 + (8.4 + 1.4), the sum operation is not associative and this is due to intermediate rounding. We could exhibit similar examples with other rounding modes too...

The problem is exactly the same in base 2 with 53 bits significand: intermediate rounding will be causing the non associativity whatever the base or significand length.

To eliminate the problem, you could either sort the numbers so that the order is allways the same, or eliminate the intermediate rounding and keep only the final one, for example with a super accumulator like this https://arxiv.org/pdf/1505.05571.pdf
...Or just accept to live with an approximate result (up to you to analyze average or worse error and decide if acceptable...).

来源：https://stackoverflow.com/questions/47296419/result-of-the-sum-of-random-ordered-ieee-754-double-precision-floats

标签

floating-point

precision

ieee-754