Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add ins
You can do it in two HADDPS instructions in SSE3:
HADDPS
v = _mm_hadd_ps(v, v); v = _mm_hadd_ps(v, v);
This puts the sum in all elements.