Fastest way to do horizontal SSE vector sum (or other reduction)

前端 未结 4 1247
傲寒
傲寒 2020-11-21 07:21

Given a vector of three (or four) floats. What is the fastest way to sum them?

Is SSE (movaps, shuffle, add, movd) always faster than x87? Are the horizontal-add ins

4条回答
  •  遇见更好的自我
    2020-11-21 07:58

    SSE2

    All four:

    const __m128 t = _mm_add_ps(v, _mm_movehl_ps(v, v));
    const __m128 sum = _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
    

    r1+r2+r3:

    const __m128 t1 = _mm_movehl_ps(v, v);
    const __m128 t2 = _mm_add_ps(v, t1);
    const __m128 sum = _mm_add_ss(t1, _mm_shuffle_ps(t2, t2, 1));
    

    I've found these to be about same speed as double HADDPS (but I haven't measured too closely).

提交回复
热议问题