How to sum __m256 horizontally?

后端 未结 2 1524
迷失自我
迷失自我 2020-11-30 07:41

I would like to horizontally sum the components of a __m256 vector using AVX instructions. In SSE I could use

_mm_hadd_ps(xmm,xmm);
_mm_hadd_ps(         


        
相关标签:
2条回答
  • 2020-11-30 08:25

    This can be done with the following code:

    ymm2 = _mm256_permute2f128_ps(ymm , ymm , 1);
    ymm = _mm256_add_ps(ymm, ymm2);
    ymm = _mm256_hadd_ps(ymm, ymm);
    ymm = _mm256_hadd_ps(ymm, ymm);
    

    but there might be a better solution.

    0 讨论(0)
  • 2020-11-30 08:46

    This version should be optimal for both Intel Sandy/Ivy Bridge and AMD Bulldozer, and later CPUs.

    // x = ( x7, x6, x5, x4, x3, x2, x1, x0 )
    float sum8(__m256 x) {
        // hiQuad = ( x7, x6, x5, x4 )
        const __m128 hiQuad = _mm256_extractf128_ps(x, 1);
        // loQuad = ( x3, x2, x1, x0 )
        const __m128 loQuad = _mm256_castps256_ps128(x);
        // sumQuad = ( x3 + x7, x2 + x6, x1 + x5, x0 + x4 )
        const __m128 sumQuad = _mm_add_ps(loQuad, hiQuad);
        // loDual = ( -, -, x1 + x5, x0 + x4 )
        const __m128 loDual = sumQuad;
        // hiDual = ( -, -, x3 + x7, x2 + x6 )
        const __m128 hiDual = _mm_movehl_ps(sumQuad, sumQuad);
        // sumDual = ( -, -, x1 + x3 + x5 + x7, x0 + x2 + x4 + x6 )
        const __m128 sumDual = _mm_add_ps(loDual, hiDual);
        // lo = ( -, -, -, x0 + x2 + x4 + x6 )
        const __m128 lo = sumDual;
        // hi = ( -, -, -, x1 + x3 + x5 + x7 )
        const __m128 hi = _mm_shuffle_ps(sumDual, sumDual, 0x1);
        // sum = ( -, -, -, x0 + x1 + x2 + x3 + x4 + x5 + x6 + x7 )
        const __m128 sum = _mm_add_ss(lo, hi);
        return _mm_cvtss_f32(sum);
    }
    

    haddps is not efficient on any CPU; the best you can do is one shuffle (to extract the high half) and one add, repeat until one element left. Narrowing to 128-bit as the first step benefits AMD before Zen2, and is not a bad thing anywhere.

    See Fastest way to do horizontal SSE vector sum on x86 for more details about efficiency.

    0 讨论(0)
提交回复
热议问题