neon float multiplication is slower than expected

こ雲淡風輕ζ 提交于 2019-12-04 08:11:56

Cortex-A8 and Cortex-A9 can do only two SP FP multiplications per cycle, so you may at most double the performance on those (most popular) CPUs. In practice, ARM CPUs have very low IPC, so it is preferably to unroll the loops as much as possible. If you want ultimate performance, write in assembly: gcc's code generator for ARM is nowhere as good as for x86.

I also recommend to use CPU-specific optimization options: "-O3 -mcpu=cortex-a9 -march=armv7-a -mtune=cortex-a9 -mfpu=neon -mthumb" for Cortex-A9; for Cortex-A15, Cortex-A8 and Cortex-A5 replace -mcpu=-mtune=cortex-a15/a8/a5 accordingly. gcc does not have optimizations for Qualcomm CPUs, so for Qualcomm Scorpion use Cortex-A8 parameters (and also unroll even more than you usually do), and for Qualcomm Krait try Cortex-A15 parameters (you will need a recent version of gcc which supports it).

One shortcoming with neon intrinsics, you can't use auto increment on loads, which shows up as extra instructions with your neon implementation.

Compiled with gcc version 4.4.3 and options -c -std=c99 -mfpu=neon -O3 and dumped with objdump, this is loop part of mul_tab_neon

000000a4 <mul_tab_neon>:
  ac:   e0805003    add r5, r0, r3
  b0:   e0814003    add r4, r1, r3
  b4:   e082c003    add ip, r2, r3
  b8:   e2833010    add r3, r3, #16
  bc:   f4650a8f    vld1.32 {d16-d17}, [r5]
  c0:   f4642a8f    vld1.32 {d18-d19}, [r4]
  c4:   e3530e19    cmp r3, #400    ; 0x190
  c8:   f3400df2    vmul.f32    q8, q8, q9
  cc:   f44c0a8f    vst1.32 {d16-d17}, [ip]
  d0:   1afffff5    bne ac <mul_tab_neon+0x8>

and this is loop part of mul_tab_standard

00000000 <mul_tab_standard>:
  58:   ecf01b02    vldmia  r0!, {d17}
  5c:   ecf10b02    vldmia  r1!, {d16}
  60:   f3410db0    vmul.f32    d16, d17, d16
  64:   ece20b02    vstmia  r2!, {d16}
  68:   e1520003    cmp r2, r3
  6c:   1afffff9    bne 58 <mul_tab_standard+0x58>

As you can see in standard case, compiler creates much tighter loop.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!