Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors [duplicate]

后端未结

关注

 1  690

爱一瞬间的悲伤 2020-12-21 19:39

1条回答

囚心锁ツ (楼主)

2020-12-21 20:19
Update: Computing 8 horizontal sums of eight AVX single-precision floating-point vectors is (I think) the same problem, solved with one a blend replacing one of the _mm256_permute2f128_ps. And another answer with more blends replacing shuffle uops. Use one of those instead.

Original answer that fails to use any blends and will bottleneck on shuffles

You can use 2x _mm256_permute2f128_ps to line up the low and high lanes for a vertical vaddps. This is instead of 2x extractf128 / insertf128. This also turns two 128b vaddps xmm instructions into a single 256b vaddps ymm.

vperm2f128 is as fast as a single vextractf128 or vinsertf128 on Intel CPUs. It's slow on AMD, though (8 m-ops with 4c latency on Bulldozer-family). Still, not so bad that you need to avoid it, even if you care about perf on AMD. (And one of the permutes can actually be a vinsertf128).
```
__m256 hsum8(__m256 a, __m256 b, __m256 c, __m256 d,
             __m256 e, __m256 f, __m256 g, __m256 h)
{
    // a = [ A7 A6 A5 A4 | A3 A2 A1 A0 ]
    __m256 sumab = _mm256_hadd_ps(a, b);
    __m256 sumcd = _mm256_hadd_ps(c, d);

    __m256 sumef = _mm256_hadd_ps(e, f);
    __m256 sumgh = _mm256_hadd_ps(g, h);

    __m256 sumabcd = _mm256_hadd_ps(sumab, sumcd);  // [ D7:4 ... A7:4 | D3:0 ... A3:0 ]
    __m256 sumefgh = _mm256_hadd_ps(sumef, sumgh);  // [ H7:4 ... E7:4 | H3:0 ... E3:0 ]

    __m256 sum_hi = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x31);  // [ H7:4 ... E7:4 | D7:4 ... A7:4 ]
    __m256 sum_lo = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x20);  // [ H3:0 ... E3:0 | D3:0 ... A3:0 ]

    __m256 result = _mm256_add_ps(sum_hi, sum_lo);
    return result;
}
```
This compiles as you'd expect. The second permute2f128 actually compiles to a vinsertf128, since it's only using the low lane of each input in the same way that vinsertf128 does. gcc 4.7 and later do this optimization, but only much more recent clang versions do (v3.7). If you care about old clang, do this at the source level.

The savings in source lines is bigger than the savings in instructions, because _mm256_extractf128_ps(sumabcd, 0); compiles to zero instructions: it's just a cast. No compiler should ever emit vextractf128 with an imm8 other than 1. (vmovdqa xmm/m128, xmm is always better for getting the low lane). Nice job Intel on wasting an instruction byte on future-proofing that you couldn't use because plain VEX prefixes don't have room to encode longer vectors.

The two vaddps xmm instructions could run in parallel, so using a single vaddps ymm is mostly just a throughput (and code size) gain, not latency.

We do shave off 3 cycles from completely eliminating the final vinsertf128, though.

vhaddps is 3 uops, 5c latency, and one per 2c throughput. (6c latency on Skylake). Two of those three uops run on the shuffle port. I guess it's basically doing 2x shufps to generate operands for addps.

If we can emulate haddps (or at least get a horizontal operation we can use) with a single shufps/addps or something, we'd come out ahead. Unfortunately, I don't see how. A single shuffle can only produce one result with data from two vectors, but we need both inputs to vertical addps to have data from both vectors.

I don't think doing the horizontal sum another way looks promising. Normally, hadd is not a good choice, because the common horizontal-sum use-case only cares about one element of its output. That's not the case here: every element of every hadd result is actually used.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题

Most efficient way to get a __m256 of horizontal sums of 8 source __m256 vectors [duplicate]

Most efficient way to get a m256 of horizontal sums of 8 source m256 vectors [duplicate]