问题
The Cortex-A57 Optimization Guide states that most integer instructions operating on 128-bit vector data can be dual-issued (Page 24, integer basic F0/F1, logical F0/F1, execution throughput 2).
However with our internal (synthetic) benchmarks, throughput seems to be limited to exactly 1 128-bit neon integer instruction, even when there is plenty of instruction parallelism available (the benchmark was written with the intention to test whether 128-bit neon instructions can be dual-issued, so this is something we took care). When mixing 50% 128-bit with 50% 64-bit instructions, we were able to achieve 1.25 instructions per clock (only neon integer arith, no loads/stores).
Are there special measures which have to be taken in order to get dual-issue throughput when using 128-bit ASIMD/Neon instructions?
Thx, Clemens
回答1:
In real code, not all instruction results will be written to the register file, instead they will pass through forwarding paths. If you mix dependent and independent instructions in your code, you may see higher IPC.
The A57 optimisation guide states that late-forwarding occurs for chains of multiply-accumulate instructions, so maybe something like this will dual-issue.
.loop
vmla.s16 q0,q0,q1
vmla.s16 q0,q0,q2
vmla.s16 q0,q0,q3
vmla.s16 q4,q4,q1
vmla.s16 q4,q4,q2
vmla.s16 q4,q4,q3
...etc
回答2:
According to ARM support the reason seems to be that the NEON register file only supports 3x 64-bit write ports.
So although the NEON ALUs are capable of processing 2x128-bit vectors, the register file is not capable of consuming the results ... what a (very) strange design descision.
来源:https://stackoverflow.com/questions/34037900/can-cortex-a57-dual-issue-128-bit-neon-instructions