Regarding minimising the error in floating-point operations, if I have an operation such as the following in C:
float a = 123.456;
float b = 456.789;
float r
Note: this answer starts with a lengthy discussion of the distinction between a = a - (r * b);
and float c = r * b; a = a - c;
with a c99-compliant compiler. The part of the question about the goal of improving accuracy while avoiding extended precision is covered at the end.
If your C99 compiler defines FLT_EVAL_METHOD
as 0, then the two computations can be expected to produce exactly the same result. If the compiler defines FLT_EVAL_METHOD
to 1 or 2, then a = a - (r * b);
will be more precise for some values of a
, r
and b
, because all intermediate computations will be done at an extended precision (double
for the value 1 and long double
for the value 2).
The program cannot set FLT_EVAL_METHOD
, but you can use commandline options to change the way your compiler computes with floating-point, and that will make it change its definition accordingly.
Depending whether you use #pragma fp_contract
in your program and on your compiler's default value for this pragma, some compound floating-point expressions can be contracted into single instructions that behave as if the intermediate result was computed with infinite precision. This happens to be a possibility for your example when targeting a modern processor, as the fused-multiply-add instruction will compute a
directly and as accurately as allowed by the floating-point type.
However, you should bear in mind that the contraction only take place at the compiler's option, without any guarantees. The compiler uses the FMA instruction to optimize speed, not accuracy, so the transformation may not take place at lower optimization levels. Sometimes several transformations are possible (e.g. a * b + c * d
can be computed either as fmaf(c, d, a*b)
or as fmaf(a, b, c*d)
) and the compiler may choose one or the other.
In short, the contraction of floating-point computations is not intended to help you achieve accuracy. You might as well make sure it is disabled if you like reproducible results.
However, in the particular case of the fused-multiply-add compound operation, you can use the C99 standard function fmaf()
to tell the compiler to compute the multiplication and addition in a single step with a single rounding. If you do this, then the compiler will not be allowed to produce anything else than the best result for a
.
float fmaf(float x, float y, float z); DESCRIPTION The fma() functions compute (x*y)+z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to the result format, according to the current rounding mode.
Note that if the FMA instruction is not available, your compiler's implementation of the function fmaf()
will at best just use higher precision, and if this happens on your compilation platform, your might just as well use the type double
for the accumulator: it will be faster and more accurate than using fmaf()
. In the worst case, a flawed implementation of fmaf()
will be provided.
Use Kahan summation if your computation involves a long chain of additions. Some accuracy can be gained by simply summing the r*b
terms computed as single-precision products, assuming there are many of them. If you wish to gain more accuracy, you might want to compute r*b
itself exactly as the sum of two single-precision numbers, but if you do this you might as well switch to double-single arithmetics entirely. Double-single arithmetics would be the same as the double-double technique succinctly described here, but with single-precision numbers instead.