selectively xor-ing elements of a list with AVX2 instructions

后端 未结 2 615
被撕碎了的回忆
被撕碎了的回忆 2021-01-21 04:20

I want to speed up the following operation with AVX2 instructions, but I was not able to find a way to do so.

I am given a large array uint64_t data[100000]

相关标签:
2条回答
  • 2021-01-21 04:33

    You can sort the data according to indices[i]... This should take O(N*log2(N)), but that can be parallelized.

    Then taking the cumulative xor of the sorted data -- which can be also parallelized.

    Then it's the matter of calculating Out[i] = CumXor(j) ^ Out[i-1];

    0 讨论(0)
  • 2021-01-21 04:39

    Based on static analysis for Haswell/Skylake, I came up with a version that runs in ~5 cycles per 4 i values, instead of 8 cycles, when compiled by gcc. Average for large sizes, not including the time to combine multiple copies of Out[], and assuming a random distribution of Indices that doesn't lead to any store/reload dep chains running for long enough to matter.

    IDK if you care about Ryzen or Excavator (the other 2 mainstream AVX2 microachitectures).

    I haven't done a careful analysis by hand, but IACA is wrong for HSW/SKL and thinks some instructions don't micro-fuse when in fact they do (tested on an i7-6700k with perf counters), so it thinks the front-end bottleneck is more severe than it really is. e.g. movhps load+merge micro-fuses, but IACA thinks it doesn't even with simple addressing modes.

    We should have negligible any cache misses, because uint64_t Out[4][256] is only 8kiB. So our cache footprint is only 1/4 of L1d size on most recent CPUs, and should be mostly fine even with hyperthreading sharing L1d between two logical threads. Looping over data[] and Indices[] should prefetch well, and hopefully doesn't evict Out[] much. Thus, static analysis has a good chance of being somewhat accurate, and it's quicker than careful micro-benchmarking and more importantly tells you exactly what the bottlenecks are.

    But of course we're relying heavily on out-of-order execution and imperfect scheduling or other unexpected bottlenecks could easily happen. I didn't feel like actually microbenchmarking if I'm not getting paid, though.

    Can we implement this more efficiently with AVX2 instructions?

    This is basically a histogram problem. The usual histogram optimization of using multiple tables and combining at the end applies. SIMD XOR is useful for the combine-at-the-end (as long as you use Out[4][256], not Out[256][4]. The latter also makes the indexing slower by requiring scaling by 8*4 instead of 8 (which can be done with a single LEA in a scaled-index addressing mode)).

    But unlike a normal histogram, you're XORing in some data from memory instead of ADDing a constant 1. So instead of an immediate 1, the code has to load data[i] into a register as a source for xor. (Or load, then xor reg, data[i] / store). This is even more total memory operations than a histogram.

    We come out ahead from "manual" gather/scatter into SIMD vectors (using movq / movhps loads/stores), allowing us to use SIMD for the data[i] load and XOR. This reduces the total number of load operations, and thus reduces load-port pressure without costing extra front-end bandwidth.

    Manual gather into 256-bit vectors probably is probably not worth the extra shuffling (an extra vinserti128 / vextracti128 just so we can combine 2 memory-source vpxors into one 256-bit one). 128-bit vectors should be good. Front-end throughput is also a major issue, because (on Intel SnB-family CPUs) you want to avoid indexed addressing modes for the stores. gcc uses lea instructions to calculate addresses in registers, instead of using indexed loads/stores. clang / LLVM with -march=skylake decides not to, which is a bad decision in this case because the loop bottlenecks on port 2 / port 3, and spending extra ALU uops to allow stores-address uops to use port 7 is a win. But if you're not bottlenecked on p23, spending extra uops to avoid indexed stores is not good. (And in cases where the can stay micro-fused, definitely not just to avoid indexed loads; silly gcc). Maybe gcc and LLVM's addressing-mode cost models aren't very accurate, or they don't model the pipeline in enough detail to figure out when a loop bottlenecks on the front-end vs. a specific port.

    Choice of addressing-modes and other asm code-gen choices are critical for this to perform optimally on SnB-family. But writing in C gives you no control over that; you're mostly at the mercy of the compiler, unless you can tweak the source to get it to make a different choice. e.g. gcc vs. clang makes a significant difference here.

    On SnB-family, a movhps load needs port 5 for the shuffle/blend (although it does micro-fuse into one uop), but a movhps store is a pure store with no ALU uop. So it's break-even there, and lets us use one SIMD load / XOR for two data elements.

    With AVX, unaligned memory source operands are allowed for ALU uops, so we don't need to require alignment for data[]. But Intel HSW/SKL can keep an indexed addressing mode micro-fused with pxor but not vpxor. So compiling without AVX enabled can be better, allowing the compiler to use an indexed addressing mode instead of incrementing a separate pointer. (Or making it faster if the compiler doesn't know this and uses an indexed addressing mode anyway.) TL:DR: probably best to require 16-byte aligned data[] and compile that function with AVX disabled, for better macro-fusion. (But then we miss out on 256-bit SIMD for combining the Out slices at the end, unless we put that in a different function compiled with AVX or AVX2)

    Avoiding unaligned loads will avoid any cache-line splits, too, which doesn't cost extra uops but we're probably close to bottlenecking on L1d throughput limits, not just load/store execution unit throughput limits.


    I also looked at loading 4 indices at once and unpacking with ALU instructions. e.g. with memcpy into struct { uint8_t idx[4]; } idx;. But gcc generates multiple wasted instructions for unpacking that. Too bad x86 doesn't have great bitfield instructions like ARM ubfx or especially PowerPC rlwinm (which could leave the result left-shifted for free, so if x86 had that, a static Out could have used a base+disp32 addressing mode in non-PIC code.)

    Unpacking a dword with shift / movzx from AL/AH is a win if we're using scalar XOR, but it looks like it's not when we're using SIMD for data[] and spending front-end throughput on lea instructions to allow store-address uops to run on port 7. That makes us front-end bottlenecked instead of port2/3 bottlenecked, so using 4x movzx loads from memory looks best according to static analysis. Would be worth benchmarking both ways if you take the time to hand-edit the asm. (The gcc-generated asm with extra uops is just bad, including a completely redundant movzx after right-shifting by 24, leaving the upper bits already zero.)


    The code

    (See it on the Godbolt compiler explorer, along with a scalar version):

    #include <immintrin.h>
    #include <stdint.h>
    #include <string.h>
    #include <stdalign.h>
    
    #ifdef IACA_MARKS
    #include "/opt/iaca-3.0/iacaMarks.h"
    #else
    #define IACA_START
    #define IACA_END
    #endif
    
    void hist_gatherscatter(unsigned idx0, unsigned idx1,
                           uint64_t Out0[256], uint64_t Out1[256],
                           __m128i vdata) {
        // gather load from Out[0][?] and Out[1][?] with movq / movhps
        __m128i hist = _mm_loadl_epi64((__m128i*)&Out0[idx0]);
        hist = _mm_castps_si128(   // movhps into the high half
                   _mm_loadh_pi(_mm_castsi128_ps(hist), (__m64*)&Out1[idx1]));
    
        // xorps could bottleneck on port5.
        // Actually probably not, using __m128 the whole time would be simpler and maybe not confuse clang
        hist = _mm_xor_si128(hist, vdata);
    
        // scatter store with movq / movhps
        _mm_storel_epi64((__m128i*)&Out0[idx0], hist);
        _mm_storeh_pi((__m64*)&Out1[idx1], _mm_castsi128_ps(hist));
    }
    
    void ext(uint64_t*);
    
    void xor_histo_avx(uint8_t *Indices, const uint64_t *data, size_t len)
    {
        alignas(32) uint64_t Out[4][256] = {{0}};
    
        // optional: peel the first iteration and optimize away loading the old known-zero values from Out[0..3][Indices[0..3]].
    
        if (len<3)   // not shown: cleanup for last up-to-3 elements.
            return;
    
        for (size_t i = 0 ; i<len ; i+=4) {
            IACA_START
            // attempt to hand-hold compiler into a dword load + shifts to extract indices
            // to reduce load-port pressure
            struct { uint8_t idx[4]; } idx;
    #if 0
            memcpy(&idx, Indices+i, sizeof(idx));  // safe with strict-aliasing and possibly-unaligned
       //gcc makes stupid asm for this, same as for memcpy into a struct,
       // using a dword load into EAX (good),
       // then AL/AH for the first 2 (good)
       // but then redundant mov and movzx instructions for the high 2
    
       // clang turns it into 4 loads
    
    /*
         //Attempt to hand-hold gcc into less-stupid asm
         //doesn't work: same asm as the struct
            uint32_t tmp;
            memcpy(&tmp, Indices+i, sizeof(tmp));  // mov eax,[mem]
            idx.idx[0] = tmp;     //movzx reg, AL
            idx.idx[1] = tmp>>8;  //movzx reg, AH
            tmp >>= 16;           //shr   eax, 16
            idx.idx[2] = tmp;     //movzx reg, AL
            idx.idx[3] = tmp>>8;  //movzx reg, AH
    */
    #else
           // compiles to separate loads with gcc and clang
            idx.idx[0] = Indices[i+0];
            idx.idx[1] = Indices[i+1];
            idx.idx[2] = Indices[i+2];
            idx.idx[3] = Indices[i+3];
    #endif
    
            __m128i vd = _mm_load_si128((const __m128i*)&data[i]);
            hist_gatherscatter(idx.idx[0], idx.idx[1], Out[0], Out[1], vd);
    
            vd = _mm_load_si128((const __m128i*)&data[i+2]);
            hist_gatherscatter(idx.idx[2], idx.idx[3], Out[2], Out[3], vd);
        }
        IACA_END
    
    
       // hand-hold compilers into a pointer-increment loop
       // to avoid indexed addressing modes.  (4/5 speedup on HSW/SKL if all the stores use port7)
        __m256i *outp = (__m256i*)&Out[0];
        __m256i *endp = (__m256i*)&Out[3][256];
        for (; outp < endp ; outp++) {
            outp[0] ^= outp[256/4*1];
            outp[0] ^= outp[256/4*2];
            outp[0] ^= outp[256/4*3];
        }
        // This part compiles horribly with -mno-avx, but does compile
        // because I used GNU C native vector operators on __m256i instead of intrinsics.
    
    /*
        for (int i=0 ; i<256 ; i+=4) {
            // use loadu / storeu if Out isn't aligned
            __m256i out0 = _mm256_load_si256(&Out[0][i]);
            __m256i out1 = _mm256_load_si256(&Out[1][i]);
            __m256i out2 = _mm256_load_si256(&Out[2][i]);
            __m256i out3 = _mm256_load_si256(&Out[3][i]);
            out0 = _mm256_xor_si256(out0, out1);
            out0 = _mm256_xor_si256(out0, out2);
            out0 = _mm256_xor_si256(out0, out3);
            _mm256_store_si256(&Out[0][i], out0);
        }
    */
    
        //ext(Out[0]);  // prevent optimizing away the work
        asm("" :: "r"(Out) : "memory");
    }
    

    Compiled with gcc7.3 -std=gnu11 -DIACA_MARKS -O3 -march=skylake -mno-avx, and analyzed with IACA-3.0:

    $ /opt/iaca-3.0/iaca xor-histo.iaca.o                                                                             Intel(R) Architecture Code Analyzer Version -  v3.0-28-g1ba2cbb build date: 2017-10-23;16:42:45
    Analyzed File -  xor-histo.iaca.o
    Binary Format - 64Bit
    Architecture  -  SKL
    Analysis Type - Throughput
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 5.79 Cycles       Throughput Bottleneck: FrontEnd
    Loop Count:  22 (this is fused-domain uops.  It's actually 20, so a 5 cycle front-end bottleneck)
    Port Binding In Cycles Per Iteration:
    --------------------------------------------------------------------------------------------------
    |  Port  |   0   -  DV   |   1   |   2   -  D    |   3   -  D    |   4   |   5   |   6   |   7   |
    --------------------------------------------------------------------------------------------------
    | Cycles |  2.0     0.0  |  3.0  |  5.5     5.1  |  5.5     4.9  |  4.0  |  3.0  |  2.0  |  3.0  |
    --------------------------------------------------------------------------------------------------
    
    DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3)
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion occurred
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
    X - instruction not supported, was not accounted in Analysis
    
    | Num Of   |                    Ports pressure in cycles                         |      |
    |  Uops    |  0  - DV    |  1   |  2  -  D    |  3  -  D    |  4   |  5   |  6   |  7   |
    -----------------------------------------------------------------------------------------
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi]
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx edx, byte ptr [rdi+0x2]
    |   1      |             |      |             |             |      |      | 1.0  |      | add rdi, 0x4
    |   1      |             |      |             |             |      |      | 1.0  |      | add rsi, 0x20
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx eax, byte ptr [rdi-0x1]
    |   1      |             | 1.0  |             |             |      |      |      |      | lea r12, ptr [rcx+r8*8]
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movzx r8d, byte ptr [rdi-0x3]
    |   1      |             | 1.0  |             |             |      |      |      |      | lea rdx, ptr [r10+rdx*8]
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [r12]
    |   1      |             |      |             |             |      | 1.0  |      |      | lea rax, ptr [r9+rax*8]
    |   1      |             | 1.0  |             |             |      |      |      |      | lea r8, ptr [r11+r8*8]
    |   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [r8]   # Wrong, 1 micro-fused uop on SKL
    |   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x20]
    |   2^     |             |      | 0.5         | 0.5         | 1.0  |      |      |      | movq qword ptr [r12], xmm0   # can run on port 7, IDK why IACA chooses not to model it there
    |   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [r8], xmm0
    |   1      |             |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | movq xmm0, qword ptr [rdx]
    |   2      |             |      | 0.5     0.5 | 0.5     0.5 |      | 1.0  |      |      | movhps xmm0, qword ptr [rax]  # Wrong, 1 micro-fused uop on SKL
    |   2^     | 1.0         |      | 0.5     0.5 | 0.5     0.5 |      |      |      |      | pxor xmm0, xmmword ptr [rsi-0x10]
    |   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movq qword ptr [rdx], xmm0
    |   2^     |             |      |             |             | 1.0  |      |      | 1.0  | movhps qword ptr [rax], xmm0
    |   1*     |             |      |             |             |      |      |      |      | cmp rbx, rdi
    |   0*F    |             |      |             |             |      |      |      |      | jnz 0xffffffffffffffa0
    Total Num Of Uops: 29  (This is unfused-domain, and a weird thing to total up).
    

    gcc8.1 on Godbolt uses a scaled-index addressing mode for pxor, using the same counter for Indices and data[], so that saves an add.

    clang doesn't use LEA, and bottlenecks at 4 is per 7 cycles, because none of the store uops can run on port 7.

    The scalar version (still using 4 slices of Out[4][256]):

    $ iaca.sh -mark 2 xor-histo.iaca.o                               
    Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300)
    Analyzed File - xor-histo.iaca.o
    Binary Format - 64Bit
    Architecture  - SKL
    Analysis Type - Throughput
    
    *******************************************************************
    Intel(R) Architecture Code Analyzer Mark Number 2
    *******************************************************************
    
    Throughput Analysis Report
    --------------------------
    Block Throughput: 7.24 Cycles       Throughput Bottleneck: FrontEnd
    
    Port Binding In Cycles Per Iteration:
    ---------------------------------------------------------------------------------------
    |  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
    ---------------------------------------------------------------------------------------
    | Cycles | 3.0    0.0  | 3.0  | 6.2    4.5  | 6.8    4.5  | 4.0  | 3.0  | 3.0  | 0.0  |
    ---------------------------------------------------------------------------------------
    
    N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
    D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
    F - Macro Fusion with the previous instruction occurred
    * - instruction micro-ops not bound to a port
    ^ - Micro Fusion happened
    # - ESP Tracking sync uop was issued
    @ - SSE instruction followed an AVX256/AVX512 instruction, dozens of cycles penalty is expected
    X - instruction not supported, was not accounted in Analysis
    
    | Num Of |                    Ports pressure in cycles                     |    |
    |  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
    ---------------------------------------------------------------------------------
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov eax, dword ptr [rdi]
    |   1    | 0.4       | 0.5 |           |           |     | 0.1 |     |     |    | add rdi, 0x4
    |   1    |           | 0.7 |           |           |     | 0.3 |     |     |    | add rsi, 0x20
    |   1*   |           |     |           |           |     |     |     |     |    | movzx r9d, al
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+r9*8-0x2040]
    |   2^   |           | 0.3 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x20]
    |   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+r9*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
    |   1*   |           |     |           |           |     |     |     |     |    | movzx edx, ah
    |   1    |           |     |           |           |     | 0.4 | 0.6 |     |    | add rdx, 0x100
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
    |   2^   | 0.6       | 0.2 | 0.5   0.5 | 0.5   0.5 |     | 0.2 | 0.1 |     |    | xor r9, qword ptr [rsi-0x18]
    |   2    |           |     | 0.2       | 0.8       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
    |   1*   |           |     |           |           |     |     |     |     |    | mov edx, eax   # gcc code-gen isn't great, but not as bad as in the SIMD loop.  No extra movzx, but not taking advantage of AL/AH
    |   1    | 0.4       |     |           |           |     |     | 0.6 |     |    | shr eax, 0x18
    |   1    | 0.8       |     |           |           |     |     | 0.2 |     |    | shr edx, 0x10
    |   1    |           | 0.6 |           |           |     | 0.3 |     |     |    | add rax, 0x300
    |   1*   |           |     |           |           |     |     |     |     |    | movzx edx, dl
    |   1    | 0.2       | 0.1 |           |           |     | 0.5 | 0.2 |     |    | add rdx, 0x200
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov r9, qword ptr [rbp+rdx*8-0x2040]
    |   2^   |           | 0.6 | 0.5   0.5 | 0.5   0.5 |     | 0.3 | 0.1 |     |    | xor r9, qword ptr [rsi-0x10]
    |   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rdx*8-0x2040], r9  # wrong, HSW/SKL can keep indexed stores fused
    |   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     |     |     |    | mov rdx, qword ptr [rbp+rax*8-0x2040]
    |   2^   |           |     | 0.5   0.5 | 0.5   0.5 |     | 0.6 | 0.4 |     |    | xor rdx, qword ptr [rsi-0x8]
    |   2    |           |     | 0.5       | 0.5       | 1.0 |     |     |     |    | mov qword ptr [rbp+rax*8-0x2040], rdx  # wrong, HSW/SKL can keep indexed stores fused
    |   1    | 0.6       |     |           |           |     |     | 0.4 |     |    | cmp r8, rdi
    |   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffff75
    Total Num Of Uops: 33
    

    The loop is 4 fused-domain uops shorter than what IACA counts, because it doesn't know that only SnB/IvB un-laminate indexed stores. HSW/SKL don't. Such stores still can't use port 7, though, so this won't get any better than ~6.5 cycles for 4 elements.

    (And BTW, with naive handling of Indices[i], loading each one separately with movzx, you get 8 cycles for 4 elements, saturating ports 2 and 3. Even though gcc doesn't generate throughput-optimal code for unpacking the struct, the 4-byte load + unpack should be a net win by relieving some load-port pressure.)


    Cleanup loop:

    AVX2 really shines here: we loop over the lowest slice of the histogram, and XOR in the other slices. This loop is 8 front-end uops with 4 loads on Skylake, and should run at 1 iter per 2 clocks:

    .L7:
        vmovdqa ymm2, YMMWORD PTR [rax+4096]
        vpxor   ymm0, ymm2, YMMWORD PTR [rax+6144]
        vmovdqa ymm3, YMMWORD PTR [rax]
        vpxor   ymm1, ymm3, YMMWORD PTR [rax+2048]
        vpxor   ymm0, ymm0, ymm1
        vmovdqa YMMWORD PTR [rax], ymm0
        add     rax, 32
        cmp     rax, rdx
        jne     .L7
    

    I tried to reduce the uop count further by doing the XORs in one chain, but gcc insists on doing two vmovdqa loads and having to do one vpxor without a memory operand. (OoO exec will hide the latency of this tiny chain / tree of VPXOR so it doesn't matter.)


    How would I use the scattering with AVX-512? Is there some scatters instruction that xors instead of overwrites?

    No, you'd use a gather to get the old values, then SIMD XOR, then scatter the updated elements back to the locations they came from.

    To avoid conflicts, you might want out[8][256] so every vector element can use a different table. (Otherwise you have a problem if Indices[i+0] and Indices[i+4] were equal, because the scatter store would just store the highest vector element with that index.

    Scatter/gather instructions need a single base register, but you can simply add _mm256_setr_epi64(0, 256, 256*2, ...); after doing a vpmovzxbq zero-extending load.


    Notes

    I used IACA2.3 for the scalar analysis because IACA3.0 seems to have removed the -mark option to choose which loop to analyze when you have multiple marks in one file. IACA3.0 didn't fix any of the ways that IACA2.3 is wrong about SKL's pipeline in this case.

    0 讨论(0)
提交回复
热议问题