How to chain multiple fma operations together for performance?

蹲街弑〆低调 提交于 2019-12-13 12:27:44

问题


Assuming that in some C or C++ code I have a function named T fma( T a, T b, T c ) that performs 1 multiplication and 1 addition like so ( a * b ) + c ; how I'm supposed to optimize multiple mul & add steps ?

For example my algorithm needs to be implemented with 3 or 4 fma operations chained and summed together, How I can write this is an efficient way and at what part of the syntax or semantics I should dedicate particular attention ?

I also would like some hints on the critical part: avoid changing the rounding mode for the CPU to avoid flushing the cpu pipeline. But I'm quite sure that just using the + operation between multiple calls to fma shouldn't change that, I'm saying "quite sure" because I don't have too many CPUs to test this, I'm just following some logical steps.

My algorithm is something like the total of multiple fma calls

fma ( triplet 1 ) + fma ( triplet 2 ) + fma ( triplet 3 )

回答1:


Recently, in Build 2014 Eric Brumer gave a very nice talk on the topic (see here). The bottom line of talk was that

Using Fused Multiply Accumulate (aka FMA) everywhere hurts performance.

In Intel CPUs a FMA instruction costs 5 cycles. Instead doing a multiplication (5 cycles) and an addition (3 cycles) costs 8 cycles. Using FMA your are getting two operations in the prize of one (see picture below).

However, FMA seems not to be the holly grail of instructions. As you can see in the picture below FMA can in certain citations hurt the performance.

In the same fashion, your case fma(triplet1) + fma(triplet2) + fma(triplet 3) costs 21 cycles whereas if you were to do the same operations with out FMA would cost 30 cycles. That's a 30% gain in performance.

Using FMA in your code would demand using compiler intrinsics. In my humble opinion though, FMA etc. is not something you should be worried about, unless you are a C++ compiler programmer. If your are not, let the compiler optimization take care of these technicalities. Generally, under such kind of concerns lies the root of all evil (i.e., premature optimization), to paraphrase one of the great ones (i.e., Donald Knuth).



来源:https://stackoverflow.com/questions/23710356/how-to-chain-multiple-fma-operations-together-for-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!