What do you do without fast gather and scatter in AVX2 instructions?

前端 未结 2 2126
耶瑟儿~
耶瑟儿~ 2021-02-15 14:28

I\'m writing a program to detect primes numbers. One part is bit sieving possible candidates out. I\'ve written a fairly fast program but I thought I\'d see if anyone has some

2条回答
  •  南方客
    南方客 (楼主)
    2021-02-15 15:10

    I just looked at exactly what you're doing here: For the mod1 = mod3 = _mm256_set1_epi64x(1); case, you're just setting single bits in a bitmap with elements of ans as the index.

    And it's unrolled by two, with ans and ans2 running in parallel, using mod1 << ans and mod3 << ans2. Comment your code and explain what's going on in the big picture using English text! This is just a very complicated implementation of the bit-setting loop of a normal Sieve of Eratosthenes. (So it would have been nice if the question had said so in the first place.)

    Unrolling with multiple start/strides in parallel is a very good optimization, so you normally set multiple bits in a cache line while it's still hot in L1d. Cache-blocking for fewer different factors at once has similar advantages. Iterate over the same 8kiB or 16kiB chunk of memory repeatedly for multiple factors (strides) before moving on to the next. Unrolling with 4 offsets for each of 2 different strides could be a good way to create more ILP.

    The more strides you run in parallel, the slower you go through new cache lines the first time you touch them, though. (Giving cache / TLB prefetch room to avoid even an initial stall). So cache blocking doesn't remove all the benefit of multiple strides.


    Possible special case for strides < 256

    A single 256-bit vector load/VPOR/store can set multiple bits. The trick is creating a vector constant, or set of vector constants, with bits in the right position. The repeating pattern is something like LCM(256, bit_stride) bits long, though. For example stride=3 would repeat in a pattern that's 3 vectors long. This very quickly gets unusable for odd / prime strides unless there's something more clever :(

    64-bit scalar is interesting because bitwise rotate is available to create a sequence of patterns, but variable-count rotate on SnB-family CPUs costs 2 uops.

    There might be more you can do with this; maybe unaligned loads could help somehow.

    A repeating pattern of bitmasks could be useful even for the large-stride case, e.g. rotating by stride % 8 every time. But that would be more useful if you were JITing a loop that hard-coded the pattern into or byte [mem], imm8, with an unroll factor chosen to be congruent with repeat length.


    Conflict reduction with narrower loads/stores

    You don't have to load/modify/store 64-bit chunks when you're only setting a single bit. The narrower your RMW operations, the closer your bit-indices can be without conflicting.

    (But you don't have a long loop-carried dep chain on the same location; you will move on before OoO exec stalls waiting for a reloads at the end of a long chain. So if conflicts aren't a correctness problem, it's unlikely to make a big perf difference here. Unlike a bitmap histogram or something where a long chain of repeated hits on nearby bits could easily happen.)

    32-bit elements would be an obvious choice. x86 can efficiently load/store dwords to/from SIMD registers as well as scalar. (scalar byte ops are efficient, too, but byte stores from SIMD regs always require multiple uops with pextrb.)

    If you're not gathering into SIMD registers, the SIMD element width for ans / ans2 doesn't have to match the RMW width. 32-bit RMW has advantages over 8-bit if you want to split a bit-index into address / bit-offset in scalar, using shifts or bts that implicitly mask the shift count to 32 bits (or 64 bits for 64-bit shifts). But 8-bit shlx or bts doesn't exist.

    The main advantage of using 64-bit SIMD elements is if you're calculating a pointer instead of just an index. If you could restrict your sieveX to 32 bits you'd still be able to do this. e.g. allocate with mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...) on Linux. That's assuming you don't need more than 2^32 bits (512MiB) sieve space, so your bit indices always fit in 32-bit elements. If that's not the case, you could still use 32-bit element vectors up to that point, then use your current loop for the high part.

    If you use 32-bit SIMD elements without restricting sieveX to be a 32-bit point pointer, you'd have to give up on using SIMD pointer calculations and just extract a bit-index, or still split in SIMD into idx/bit and extract both.

    (With 32-bit elements, a SIMD -> scalar strategy based on store/reload looks even more attractive, but in C that's mostly up to your compiler.)

    If you were manually gathering into 32-bit elements, you couldn't use movhps anymore. You'd have to use pinsrd / pextrd for the high 3 elements, and those never micro-fuse / always need a port5 uop on SnB-family. (Unlike movhps which is a pure store). But that means vpinsrd is still 2 uops with an indexed addressing mode. You could still use vmovhps for element 2 (then overwrite the top dword of the vector with vpinsrd); unaligned loads are cheap and it's ok to overlap the next element. But you can't do movhps stores, and that's where it was really good.


    There are two big performance problems with your current strategy:

    Apparently you're sometimes using this with some elements of mod1 or mod3 being 0, resulting in completely useless wasted work, doing [mem] |= 0 for those strides.

    I think once an element in ans or ans2 reaches total, you're going to fall out of the inner loop and do ans -= sum 1 every time through the inner loop. You don't necessarily want to reset it back ans = sum (for that element) to redo the ORing (setting bits that were already set), because that memory will be cold in cache. What we really want is to pack the remaining still-in-use elements into known locations and enter other versions of the loop that only do 7, then 6, then 5 total elements. Then we're down to only 1 vector.

    That seems really clunky. A better strategy for one element hitting the end might be to finish the remaining three in that vector with scalar, one at a time, then run the remaining single __m256i vector. If the strides are all nearby, you probably get good cache locality.


    Cheaper scalar, or maybe still SIMD but extract only a bit-index

    Splitting the bit-index into a qword index and a bitmask with SIMD and then extracting both separately costs a lot of uops for the scalar-OR case: so many that you're not bottlenecking on 1-per-clock store throughput, even with all the optimizations in my scatter/gather answer. (Cache misses may slow this down sometimes, but fewer front-end uops means a larger out-of-order window to find parallelism and keep more memory ops in flight.)

    If we can get the compiler to make good scalar code to split a bit-index, we could consider pure scalar. Or at least extracting only bit-indices and skipping the SIMD shift/mask stuff.

    It's too bad scalar memory-destination bts is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)

    Pure scalar may not be ideal, though. I was playing around with this on Godbolt:

    #include 
    #include 
    #include 
    
    // Sieve the bits in array sieveX for later use
    void sieveFactors(uint64_t *sieveX64, unsigned cur1, unsigned cur2, unsigned factor1, unsigned factor2)
    {
        const uint64_t totalX = 5000000;
    #ifdef USE_AVX2
    //...
    #else
         //uint64_t cur = 58;
         //uint64_t cur2 = 142;
         //uint64_t factor = 67;
         uint32_t *sieveX = (uint32_t*)sieveX64;
    
        if (cur1 > cur2) {
            // TODO: if factors can be different, properly check which will end first
            std::swap(cur1, cur2);
            std::swap(factor1, factor2);
        }
        // factor1 = factor2;  // is this always true?
    
        while (cur2 < totalX) {
             sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
             sieveX[cur2 >> 5] |= (1U << (cur2 & 0x1f));
             cur1 += factor1;
             cur2 += factor2;
        }
        while (cur1 < totalX) {
             sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
             cur1 += factor1;
        }
    #endif
    }
    

    Note how I replaced your outer if() to choose between loops with sorting cur1, cur2.

    GCC and clang put a 1 in a register outside the loop, and use shlx r9d, ecx, esi inside the loop to do 1U << (cur1 & 0x1f) in a single uop without destroying the 1. (MSVC uses load / BTS / store, but it's clunky with a lot of mov instructions. I don't know how to tell MSVC it's allowed to use BMI2.)

    If an indexed addressing mode for or [mem], reg didn't cost an extra uop, this would be great.

    The problem is that you need a shr reg, 5 in there somewhere, and that's destructive. Putting 5 in a register and using that to copy+shift the bit-index would be an ideal setup for load / BTS / store, but compilers don't know that optimization it seems.

    Optimal(?) scalar split and use of a bit-index

       mov   ecx, 5    ; outside the loop
    
    .loop:
        ; ESI is the bit-index.
        ; Could be pure scalar, or could come from an extract of ans directly
    
        shrx  edx, esi, ecx           ; EDX = ESI>>5 = dword index
        mov   eax, [rdi + rdx*4]
        bts   eax, esi                ; set esi % 32 in EAX
        mov   [rdi + rdx*4]
    
    
        more unrolled iterations
    
        ; add   esi, r10d               ; ans += factor if we're doing scalar
    
        ...
        cmp/jb .loop
    

    So given a bit-index in a GP register, that's 4 uops to set the bit in memory. Notice that the load and store are both with mov, so indexed addressing modes have no penalty on Haswell and later.

    But the best I could get compilers to make was 5, I think, using shlx / shr / or [mem], reg. (With an indexed addressing mode the or is 3 uops instead of 2.)

    I think if you're willing to use hand-written asm, you can go faster with this scalar and ditch SIMD entirely. Conflicts are never a correctness problem for this.

    Maybe you can even get a compiler to emit something comparable, but even a single extra uop per unrolled RMW is a big deal.

提交回复
热议问题