Why does the FMA _mm256_fmadd_pd() intrinsic have 3 asm mnemonics, “vfmadd132pd”, “231” and “213”?

前端 未结 2 1733
一个人的身影
一个人的身影 2021-02-19 13:30

Could someone explain to me why there are 3 variants of the fused multiply-accumulate instruction: vfmadd132pd, vfmadd231pd and vfmadd213pd

2条回答
  •  春和景丽
    2021-02-19 14:29

    The fused multiply-add instructions multiply two (packed) values, add a third value, and then overwrite one of the values with the result. Only one of the three values can be a memory operand rather than a register.

    The way it works is that all three instructions overwrite ymm0 and allow only ymm2 to be a memory operand. The choice of instruction determines which two operands are multiplied and which is added.

    Assuming that ymm0 is the first operand in Intel syntax (or the last in AT&T syntax):

    vfmadd132pd:  ymm0 = ymm0 * ymm2/mem + ymm1
    vfmadd231pd:  ymm0 = ymm1 * ymm2/mem + ymm0
    vfmadd213pd:  ymm0 = ymm1 * ymm0 + ymm2/mem 
    

    When using the C intrinsics, this choice isn't necessary: The intrinsic does not overwrite a value but returns its result instead, and it allows all three values to be read from memory. The compiler will add memory reads/writes if needed, and will allocate a temporary register to store the result if it does not want any of the three values to be overwritten. It will choose one of the three instructions as it sees fit.

提交回复
热议问题