Given a vector of three (or four) floats. What is the fastest way to sum them?
Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add ins
const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));
I've found these to be about same speed as double HADDPS
(but I haven't measured too closely).