I\'m writing a program to detect primes numbers. One part is bit sieving possible candidates out. I\'ve written a fairly fast program but I thought I\'d see if anyone has some
I just looked at exactly what you're doing here: For the mod1 = mod3 = _mm256_set1_epi64x(1);
case, you're just setting single bits in a bitmap with elements of ans
as the index.
And it's unrolled by two, with ans and ans2 running in parallel, using mod1 << ans
and mod3 << ans2
. Comment your code and explain what's going on in the big picture using English text! This is just a very complicated implementation of the bit-setting loop of a normal Sieve of Eratosthenes. (So it would have been nice if the question had said so in the first place.)
Unrolling with multiple start/strides in parallel is a very good optimization, so you normally set multiple bits in a cache line while it's still hot in L1d. Cache-blocking for fewer different factors at once has similar advantages. Iterate over the same 8kiB or 16kiB chunk of memory repeatedly for multiple factors (strides) before moving on to the next. Unrolling with 4 offsets for each of 2 different strides could be a good way to create more ILP.
The more strides you run in parallel, the slower you go through new cache lines the first time you touch them, though. (Giving cache / TLB prefetch room to avoid even an initial stall). So cache blocking doesn't remove all the benefit of multiple strides.
A single 256-bit vector load/VPOR/store can set multiple bits. The trick is creating a vector constant, or set of vector constants, with bits in the right position. The repeating pattern is something like LCM(256, bit_stride)
bits long, though. For example stride=3 would repeat in a pattern that's 3 vectors long. This very quickly gets unusable for odd / prime strides unless there's something more clever :(
64-bit scalar is interesting because bitwise rotate is available to create a sequence of patterns, but variable-count rotate on SnB-family CPUs costs 2 uops.
There might be more you can do with this; maybe unaligned loads could help somehow.
A repeating pattern of bitmasks could be useful even for the large-stride case, e.g. rotating by stride % 8
every time. But that would be more useful if you were JITing a loop that hard-coded the pattern into or byte [mem], imm8
, with an unroll factor chosen to be congruent with repeat length.
You don't have to load/modify/store 64-bit chunks when you're only setting a single bit. The narrower your RMW operations, the closer your bit-indices can be without conflicting.
(But you don't have a long loop-carried dep chain on the same location; you will move on before OoO exec stalls waiting for a reloads at the end of a long chain. So if conflicts aren't a correctness problem, it's unlikely to make a big perf difference here. Unlike a bitmap histogram or something where a long chain of repeated hits on nearby bits could easily happen.)
32-bit elements would be an obvious choice. x86 can efficiently load/store dwords to/from SIMD registers as well as scalar. (scalar byte ops are efficient, too, but byte stores from SIMD regs always require multiple uops with pextrb
.)
If you're not gathering into SIMD registers, the SIMD element width for ans
/ ans2
doesn't have to match the RMW width. 32-bit RMW has advantages over 8-bit if you want to split a bit-index into address / bit-offset in scalar, using shifts or bts
that implicitly mask the shift count to 32 bits (or 64 bits for 64-bit shifts). But 8-bit shlx
or bts
doesn't exist.
The main advantage of using 64-bit SIMD elements is if you're calculating a pointer instead of just an index. If you could restrict your sieveX
to 32 bits you'd still be able to do this. e.g. allocate with mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...)
on Linux. That's assuming you don't need more than 2^32 bits (512MiB) sieve space, so your bit indices always fit in 32-bit elements. If that's not the case, you could still use 32-bit element vectors up to that point, then use your current loop for the high part.
If you use 32-bit SIMD elements without restricting sieveX to be a 32-bit point pointer, you'd have to give up on using SIMD pointer calculations and just extract a bit-index, or still split in SIMD into idx
/bit
and extract both.
(With 32-bit elements, a SIMD -> scalar strategy based on store/reload looks even more attractive, but in C that's mostly up to your compiler.)
If you were manually gathering into 32-bit elements, you couldn't use movhps
anymore. You'd have to use pinsrd
/ pextrd
for the high 3 elements, and those never micro-fuse / always need a port5 uop on SnB-family. (Unlike movhps
which is a pure store). But that means vpinsrd
is still 2 uops with an indexed addressing mode. You could still use vmovhps
for element 2 (then overwrite the top dword of the vector with vpinsrd
); unaligned loads are cheap and it's ok to overlap the next element. But you can't do movhps
stores, and that's where it was really good.
There are two big performance problems with your current strategy:
Apparently you're sometimes using this with some elements of mod1
or mod3
being 0
, resulting in completely useless wasted work, doing [mem] |= 0
for those strides.
I think once an element in ans
or ans2
reaches total
, you're going to fall out of the inner loop and do ans -= sum
1 every time through the inner loop. You don't necessarily want to reset it back ans = sum
(for that element) to redo the ORing (setting bits that were already set), because that memory will be cold in cache. What we really want is to pack the remaining still-in-use elements into known locations and enter other versions of the loop that only do 7, then 6, then 5 total elements. Then we're down to only 1 vector.
That seems really clunky. A better strategy for one element hitting the end might be to finish the remaining three in that vector with scalar, one at a time, then run the remaining single __m256i
vector. If the strides are all nearby, you probably get good cache locality.
Splitting the bit-index into a qword index and a bitmask with SIMD and then extracting both separately costs a lot of uops for the scalar-OR case: so many that you're not bottlenecking on 1-per-clock store throughput, even with all the optimizations in my scatter/gather answer. (Cache misses may slow this down sometimes, but fewer front-end uops means a larger out-of-order window to find parallelism and keep more memory ops in flight.)
If we can get the compiler to make good scalar code to split a bit-index, we could consider pure scalar. Or at least extracting only bit-indices and skipping the SIMD shift/mask stuff.
It's too bad scalar memory-destination bts
is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]
. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)
Pure scalar may not be ideal, though. I was playing around with this on Godbolt:
#include
#include
#include
// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX64, unsigned cur1, unsigned cur2, unsigned factor1, unsigned factor2)
{
const uint64_t totalX = 5000000;
#ifdef USE_AVX2
//...
#else
//uint64_t cur = 58;
//uint64_t cur2 = 142;
//uint64_t factor = 67;
uint32_t *sieveX = (uint32_t*)sieveX64;
if (cur1 > cur2) {
// TODO: if factors can be different, properly check which will end first
std::swap(cur1, cur2);
std::swap(factor1, factor2);
}
// factor1 = factor2; // is this always true?
while (cur2 < totalX) {
sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
sieveX[cur2 >> 5] |= (1U << (cur2 & 0x1f));
cur1 += factor1;
cur2 += factor2;
}
while (cur1 < totalX) {
sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
cur1 += factor1;
}
#endif
}
Note how I replaced your outer if() to choose between loops with sorting cur1, cur2.
GCC and clang put a 1
in a register outside the loop, and use shlx r9d, ecx, esi
inside the loop to do 1U << (cur1 & 0x1f)
in a single uop without destroying the 1
. (MSVC uses load / BTS / store, but it's clunky with a lot of mov
instructions. I don't know how to tell MSVC it's allowed to use BMI2.)
If an indexed addressing mode for or [mem], reg
didn't cost an extra uop, this would be great.
The problem is that you need a shr reg, 5
in there somewhere, and that's destructive. Putting 5
in a register and using that to copy+shift the bit-index would be an ideal setup for load / BTS / store, but compilers don't know that optimization it seems.
Optimal(?) scalar split and use of a bit-index
mov ecx, 5 ; outside the loop
.loop:
; ESI is the bit-index.
; Could be pure scalar, or could come from an extract of ans directly
shrx edx, esi, ecx ; EDX = ESI>>5 = dword index
mov eax, [rdi + rdx*4]
bts eax, esi ; set esi % 32 in EAX
mov [rdi + rdx*4]
more unrolled iterations
; add esi, r10d ; ans += factor if we're doing scalar
...
cmp/jb .loop
So given a bit-index in a GP register, that's 4 uops to set the bit in memory. Notice that the load and store are both with mov
, so indexed addressing modes have no penalty on Haswell and later.
But the best I could get compilers to make was 5, I think, using shlx / shr / or [mem], reg
. (With an indexed addressing mode the or
is 3 uops instead of 2.)
I think if you're willing to use hand-written asm, you can go faster with this scalar and ditch SIMD entirely. Conflicts are never a correctness problem for this.
Maybe you can even get a compiler to emit something comparable, but even a single extra uop per unrolled RMW is a big deal.