Update: Computing 8 horizontal sums of eight AVX single-precision floating-point vectors is (I think) the same problem, solved with one a blend replacing one of the _mm256_permute2f128_ps. And another answer with more blends replacing shuffle uops. Use one of those instead.
Original answer that fails to use any blends and will bottleneck on shuffles
You can use 2x _mm256_permute2f128_ps
to line up the low and high lanes for a vertical vaddps
. This is instead of 2x extractf128
/ insertf128
. This also turns two 128b vaddps xmm
instructions into a single 256b vaddps ymm
.
vperm2f128
is as fast as a single vextractf128
or vinsertf128
on Intel CPUs. It's slow on AMD, though (8 m-ops with 4c latency on Bulldozer-family). Still, not so bad that you need to avoid it, even if you care about perf on AMD. (And one of the permutes can actually be a vinsertf128
).
__m256 hsum8(__m256 a, __m256 b, __m256 c, __m256 d,
__m256 e, __m256 f, __m256 g, __m256 h)
{
// a = [ A7 A6 A5 A4 | A3 A2 A1 A0 ]
__m256 sumab = _mm256_hadd_ps(a, b);
__m256 sumcd = _mm256_hadd_ps(c, d);
__m256 sumef = _mm256_hadd_ps(e, f);
__m256 sumgh = _mm256_hadd_ps(g, h);
__m256 sumabcd = _mm256_hadd_ps(sumab, sumcd); // [ D7:4 ... A7:4 | D3:0 ... A3:0 ]
__m256 sumefgh = _mm256_hadd_ps(sumef, sumgh); // [ H7:4 ... E7:4 | H3:0 ... E3:0 ]
__m256 sum_hi = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x31); // [ H7:4 ... E7:4 | D7:4 ... A7:4 ]
__m256 sum_lo = _mm256_permute2f128_ps(sumabcd, sumefgh, 0x20); // [ H3:0 ... E3:0 | D3:0 ... A3:0 ]
__m256 result = _mm256_add_ps(sum_hi, sum_lo);
return result;
}
This compiles as you'd expect. The second permute2f128
actually compiles to a vinsertf128
, since it's only using the low lane of each input in the same way that vinsertf128
does. gcc 4.7 and later do this optimization, but only much more recent clang versions do (v3.7). If you care about old clang, do this at the source level.
The savings in source lines is bigger than the savings in instructions, because _mm256_extractf128_ps(sumabcd, 0);
compiles to zero instructions: it's just a cast. No compiler should ever emit vextractf128
with an imm8 other than 1
. (vmovdqa xmm/m128, xmm
is always better for getting the low lane). Nice job Intel on wasting an instruction byte on future-proofing that you couldn't use because plain VEX prefixes don't have room to encode longer vectors.
The two vaddps xmm
instructions could run in parallel, so using a single vaddps ymm
is mostly just a throughput (and code size) gain, not latency.
We do shave off 3 cycles from completely eliminating the final vinsertf128
, though.
vhaddps
is 3 uops, 5c latency, and one per 2c throughput. (6c latency on Skylake). Two of those three uops run on the shuffle port. I guess it's basically doing 2x shufps
to generate operands for addps
.
If we can emulate haddps
(or at least get a horizontal operation we can use) with a single shufps
/addps
or something, we'd come out ahead. Unfortunately, I don't see how. A single shuffle can only produce one result with data from two vectors, but we need both inputs to vertical addps
to have data from both vectors.
I don't think doing the horizontal sum another way looks promising. Normally, hadd is not a good choice, because the common horizontal-sum use-case only cares about one element of its output. That's not the case here: every element of every hadd
result is actually used.