Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.
It seems that you want to get the sum of a certain length of array, and not only four float values.
In that case, your code will work, but is far from optimized :
many many pipeline interlocks
unnecessary 32bit addition per iteration
Assuming the length of the array is a multiple of 8 and at least 16 :
vldmia {q0-q1}, [pSrc]!
sub count, count, #8
loop:
pld [pSrc, #32]
vldmia {q3-q4}, [pSrc]!
subs count, count, #8
vadd.f32 q0, q0, q3
vadd.f32 q1, q1, q4
bgt loop
vadd.f32 q0, q0, q1
vpadd.f32 d0, d0, d1
vadd.f32 s0, s0, s1
- pld - while being an ARM instruction and not NEON - is crucial for performance. It drastically increases cache hit rate.
I hope the rest of the code above is self explanatory.
You will notice that this version is many times faster than your initial one.
You might try this (it's not in ASM, but you should be able to convert it easily):
float32x2_t r = vadd_f32(vget_high_f32(m_type), vget_low_f32(m_type));
return vget_lane_f32(vpadd_f32(r, r), 0);
In ASM it would be probably only VADD and VPADD.
I'm not sure if this is only one method to do this (and most optimal), but I haven't figured/found better one...
PS. I'm new to NEON too
Here is the code in ASM:
vpadd.f32 d1,d6,d7 @ q3 is register that needs all of its contents summed
vadd.f32 s1,s2,s3 @ now we add the contents of d1 together (the sum)
vadd.f32 s0,s0,s1 @ sum += s1;
I may have forgotten to mention that in C the code would look like this:
float sum = 1.0f;
sum += number1 * number2;
I have omitted the multiplication from this little piece asm of code.
来源:https://stackoverflow.com/questions/6931217/sum-all-elements-in-a-quadword-vector-in-arm-assembly-with-neon