Can Cortex-A57 dual-issue 128-bit neon instructions?

纵饮孤独 提交于 2021-01-27 13:14:35

问题


The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).

However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).

Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?

Thx, Clemens


回答1:


In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.

The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.

.loop
    vmla.s16 q0,q0,q1
    vmla.s16 q0,q0,q2
    vmla.s16 q0,q0,q3
    vmla.s16 q4,q4,q1
    vmla.s16 q4,q4,q2
    vmla.s16 q4,q4,q3
    ...etc



回答2:


According to ARM support the reason seems to be that the NEON register file only supports 3x 64-bit write ports.

So although the NEON ALUs are capable of processing 2x128-bit vectors, the register file is not capable of consuming the results ... what a (very) strange design descision.



来源:https://stackoverflow.com/questions/34037900/can-cortex-a57-dual-issue-128-bit-neon-instructions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!