Sum all elements in a quadword vector in ARM assembly with NEON

我只是一个虾纸丫 提交于 2019-12-05 09:25:09

It seems that you want to get the sum of a certain length of array, and not only four float values.

In that case, your code will work, but is far from optimized :

  1. many many pipeline interlocks

  2. unnecessary 32bit addition per iteration

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1
  • pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.

I hope the rest of the code above is self explanatory.

You will notice that this version is many times faster than your initial one.

You might try this (it's not in ASM, but you should be able to convert it easily):

float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);

In ASM it would be probably only VADD and VPADD.

I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...

PS. I'm new to NEON too

Here is the code in ASM:

    vpadd.f32 d1,d6,d7    @ q3 is register that needs all of its contents summed          
    vadd.f32 s1,s2,s3     @ now we add the contents of d1 together (the sum)                
    vadd.f32 s0,s0,s1     @ sum += s1;

I may have forgotten to mention that in C the code would look like this:

float sum = 1.0f;
sum += number1 * number2;

I have omitted the multiplication from this little piece asm of code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!