Fastest way to do horizontal SSE vector sum (or other reduction)

前端未结

关注

 4  1247

Given a vector of three (or four) floats. What is the fastest way to sum them?

Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add ins

4条回答

2020-11-21 07:58

SSE2

const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));

const __m128 t1 = _mm_movehl_ps(v, v);
const __m128 t2 = _mm_add_ps(v, t1);
const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));

I've found these to be about same speed as double HADDPS (but I haven't measured too closely).

0 讨论(0)