How to chain multiple fma operations together for performance?

倖福魔咒の 提交于 2019-12-05 12:54:34

Recently, in Build 2014 Eric Brumer gave a very nice talk on the topic (see here). The bottom line of talk was that

Using Fused Multiply Accumulate (aka FMA) everywhere hurts performance.

In Intel CPUs a FMA instruction costs 5 cycles. Instead doing a multiplication (5 cycles) and an addition (3 cycles) costs 8 cycles. Using FMA your are getting two operations in the prize of one (see picture below).

However, FMA seems not to be the holly grail of instructions. As you can see in the picture below FMA can in certain citations hurt the performance.

In the same fashion, your case fma(triplet1) + fma(triplet2) + fma(triplet 3) costs 21 cycles whereas if you were to do the same operations with out FMA would cost 30 cycles. That's a 30% gain in performance.

Using FMA in your code would demand using compiler intrinsics. In my humble opinion though, FMA etc. is not something you should be worried about, unless you are a C++ compiler programmer. If your are not, let the compiler optimization take care of these technicalities. Generally, under such kind of concerns lies the root of all evil (i.e., premature optimization), to paraphrase one of the great ones (i.e., Donald Knuth).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!