Is there any scenario where function fma in libc can be used?

I come across this page and find there is an odd floating multiply add function --fma and fmaf. It says that the result is something like:

 (x * y) + z             #fma(x,y,z)

And the value is infinite precision and round once to the result format .

However, AFAICT I've never seen such a ternary operation before. So I'm wondering what's the cumstom usage for this func.

The important aspect of the fused-multiply-add instruction is the (virtually) infinite precision of the intermediate result. This helps with performance, but not so much because two operations are encoded in a single instruction — It helps with performance because the virtually infinite precision of the intermediate result is sometimes important, and very expensive to recover with ordinary multiplication and addition when this level of precision is really what the programmer is after.

Example: comparing `a * b` to `1.0`

Suppose that it is crucial to an algorithm to determine where the product of two double-precision numbers a and b is with respect to a nonzero constant (we'll use 1.0). The numbers a and b both have full significands of binary digits. If you compute a*b as a double, the result may be 1.0, but that does not tell you whether the actual mathematical product was slightly below 1.0 and rounded up to exactly 1.0, or slightly above 1.0 and rounded down. Without FMA, your options are:

compute a*b as a quad-precision number. Quad-precision is not implemented in hardware but there are software emulation libraries. In quad-precision, the mathematical result of the product is exactly representable and you can then compare it to 1.0.
Compute a*b in double precision in round-upward mode and in round-downward mode. If both results are 1.0, it means a*b is exactly 1.0. If RU(a * b) is greater than 1.0, it means the mathematical product is higher than 1.0, and if RD(a * b) is below 1.0, that means the mathematical product is lower than 1.0. On most processors, this approach means changing the rounding mode three times, and each change is expensive (it involves flushing the CPU pipeline).

With a FMA instruction, one can compute fma(a, b, -1.0) and compare the result to 0.0. Since floating-point numbers are denser around zero, and since the intermediate product is not rounded in the computation, we can be certain that fma(a, b, -1.0) > 0 means the mathematical product of a and b is greater than 1, and so on.

Example: Veltkamp/Dekker multiplication

The double-double format is an efficient representation of numbers as the sum of two double-precision floating-point numbers. It is nearly as precise as quad-precision but takes advantage of existing double-precision hardware.

Consider the following function, Mul12(a, b), that takes two double-precision numbers a and b and computes their product as a double-double number. An algorithm, due to Veltkamp and Dekker, computes this function with only double-precision addition and multiplication (reference). It takes 6 multiplications (one is part of each Split() plus four in the main body of the algorithm), and plenty of additions.

If a FMA instruction is available, Mul12 can be implemented as two operations, one multiplication and one FMA.

high = a * b; /* double-precision approximation of the real product */
low = fma(a, b, -high); /* remainder of the real product */
/* now the real product of a and b is available as the sum of high and low */

More examples

Examples where FMA is used for its precision, and not only as an instruction that does a multiplication and an addition, are the computation of square root and division. These operations have to be correctly rounded (to the nearest floating-point number of the mathematical result) according to the IEEE 754 standard. These two operations can be implemented efficiently when a hardware FMA instruction is available. This aspect is typically hidden by the compilation chain, but the IA-64 instruction set (Itanium) did not have an instruction for division. Instead, the correctly rounded division could be obtained by a sequence of instructions (typically generated by the compiler) involving FMA.

It's usually used as an optimization. Most floating-point units have an fma instruction, so the calculation can be performed in a single instruction instead of two or more. So for performance-critical floating-point code, it's a helpful function to have.

来源：https://stackoverflow.com/questions/13292013/is-there-any-scenario-where-function-fma-in-libc-can-be-used

标签