I\'m writing a program to detect primes numbers. One part is bit sieving possible candidates out. I\'ve written a fairly fast program but I thought I\'d see if anyone has some
IDK why you use different parts of the same cur[8]
array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.
It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX[]
; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.
AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq
). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.
Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps
) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.
See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.
With AVX, you have 3 main options:
curr[4..7]
at once) might help. Unless those mask
vectors have different values in different elements.(Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or
instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)
Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add
or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.
extract indices and values to scalar like you're doing now, so the OR into sieveX[]
is pure scalar. Works even when two or more indices are the same.
This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile
. And BTW, you'd want to use alignas(32) cur[8]
to make sure actual vector stores don't cross a cache-line boundary.
or [rdi + rax*8], rdx
(with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli
by 3 instead of 6, mask off the low 3 bits (vpand
), and vpaddq
with set1_epi64(sieveX)
. So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t*
pointer elements instead of uint64_t
indices. Or if sieveX
can be a 32-bit absolute address1, you could skip the vpaddq
and extract already-scaled indices for the same gain.)
It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or
completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.
This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.
extract indices, manually gather into a vector with vmovq
/ vmovhps
for a SIMD vpor
, then scatter back with vmovq
/ vmovhps
.
Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).
See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128
/ vextracti128
.
Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64
result so you have (the data that would be) cur[4..5]
and cur[6..7]
in two separate __m128i
variables. You'd have vextracti128
/ 2x vpor xmm
instead of vinserti128
/ vpor ymm
/ vextracti128
.
The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.
Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps
loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps
2). See the previous bullet point. But vmovq
loads/stores are always a single uop, and vmovhps
indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.
vpgatherqq
for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)
), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).Possibly a win on Skylake (where vpgatherqq ymm
is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.
It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1
is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)
So your SIMD calculation where you need to frequently extract a vector to scalar is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.
extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.
vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake)
2x vextracti128 (2 uops for port 5)
4x vmovq (4 p0)
4x vpextrq (8: 4p0 4p5)
4x or [r], r
(4x2 = 8 front-end uops each. backend: 4p0156 4p23 (load) 4p237 (store-addres) 4p4 (store-data)). Non-indexed addressing mode.
Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)
manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.
(optional, probably not worth it): vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake) Skip these if you could use non-VEX movhps
for a 1-uop micro-fused indexed load. (But then p237 stores become p23).
vextracti128 pointers (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
2x vmovq load (2p23)
2x vmovhps xmm, xmm, [r]
non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)
vextracti128 split the data (p5)
2x vpor xmm
(2p015)
2x vmovq store (2x 1 micro-fused uop, 2p237 + 2p4)
2x vmovhps store (2x 1 micro-fused uop, 2p237 + 2p4)
Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq
is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64
is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.
Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)
Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).
vextracti128 indices (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
vpcmpeqd ymm0,ymm0,ymm0
create an all-ones mask for vpgatherqq
(p015)
vpgatherqq ymm1, [rdi + ymm2*8], ymm0
4 uops for some ports.
vpor ymm
(p015)
vextracti128 on the OR result (p5)
2x vmovq store (2x 1 micro-fused uop, 2p23 + 2p4). Note no port7, we're using indexed stores.
2x vmovhps store (2x 1 micro-fused uop, 2p23 + 2p4).
So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.
We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.
You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2]
.
You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq
is micro-code and not fast. (KNL has single-uop vpconflictq
but it's still faster to avoid it entirely).
I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.
You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.
There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.
So if you can reliably detect when you're done with conflicts, use something like
int conflictcheck = 10;
do {
if (--conflictcheck == 0) {
vector stuff to check for conflicts
if (no conflicts now or in the future)
break;
conflictcheck = 10; // reset the down-counter
}
main loop body, extract -> scalar OR strategy
} while(blah);
// then fall into the gather/scatter loop.
do {
main loop body, gather + manual scatter strategy
} while();
That should compile to a dec / je
which only costs 1 uop in the not-taken case.
Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.
Footnote 1:
If sieveX
is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32
as part of a [reg+disp32]
addressing mode. In that case you can leave out the vpaddq
. But getting a compiler to treat a uint64_t
as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX
to uintptr_t
and add, then cast back.
This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8]
only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32
is just base + disp32.
Footnote 2: Fun fact: non-VEX movhps
loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.
IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.
This is basically a missed-optimization for -mtune=skylake
, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.
The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps
is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf
to count uops, with this loop:
mov r15d, 100000000
.loop:
vpaddq ymm0, ymm1, ymm2 ; dirty the upper part
vpaddq ymm3, ymm1, ymm2 ; dirty another register for good measure
vmovq xmm0, [rdi+rbx*8] ; zero the full register, breaking dependencies
movhps xmm0, [rdi+rbx*8+8] ; RMW the low 128 bits
; fast on Skylake, will stall on Haswell
dec r15d
jnz .loop
The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any
), 6 unfused-domain uops (uops_executed.thread
). So micro-fusion was definitely happening for movhps
without any SSE/AVX problems.
Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8]
slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.
There's no extra uop if the upper half of ymm0
is dirty when movhps xmm0, [mem]
runs. I tested by commenting out the vmovq
. But changing vmovq
to movq
does result in an extra uop: movq
becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps
).
Also note that pinsrq xmm0, [mem], 1
can't micro fuse even without VEX. But with VEX, you should prefer vmovhps
for code-size reasons.
Your compiler may want to "optimize" the intrinsic for movhps
on integer data into vpinsrq
, though, I didn't check.