Using SIMD/AVX/SSE for tree traversal

前端 未结 2 626
感动是毒
感动是毒 2020-12-25 10:23

I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multip

相关标签:
2条回答
  • 2020-12-25 10:56

    Based on your code, i've went ahead and benchmarked 3 options: AVX2-powered, nested branching (4 jumps) and a branchless variant. These are the results:

    // Performance Table... // All using cache-line size 64byteAligned chunks (van Emde-Boas Layout); loop unrolled per cacheline; // all optimizations turned on. Each Element being 4 byte's. Intel i7 4770k Haswell @3.50GHz

    Type        ElementAmount       LoopCount       Avg. Cycles / Query
    ===================================================================
    AVX2        210485750           100000000       610 cycles    
    AVX2        21048575            100000000       427 cycles           
    AVX2        2104857             100000000       288 cycles 
    AVX2        210485              100000000       157 cycles   
    AVX2        21048               100000000       95 cycles  
    AVX2        2104                100000000       49 cycles    
    AVX2        210                 100000000       17 cycles 
    AVX2        100                 100000000       16 cycles   
    
    
    Type        ElementAmount       LoopCount       Avg. Cycles / Query
    ===================================================================  
    Branching   210485750           100000000       819 cycles 
    Branching   21048575            100000000       594 cycles 
    Branching   2104857             100000000       358 cycles 
    Branching   210485              100000000       165 cycles 
    Branching   21048               100000000       82 cycles
    Branching   2104                100000000       49 cycles 
    Branching   210                 100000000       21 cycles 
    Branching   100                 100000000       16 cycles   
    
    
    Type        ElementAmount       LoopCount       Avg. Cycles / Query
    =================================================================== 
    BranchLESS  210485750           100000000       675 cycles 
    BranchLESS  21048575            100000000       602 cycles 
    BranchLESS  2104857             100000000       417 cycles
    BranchLESS  210485              100000000       273 cycles 
    BranchLESS  21048               100000000       130 cycles 
    BranchLESS  2104                100000000       72 cycles 
    BranchLESS  210                 100000000       27 cycles 
    BranchLESS  100                 100000000       18 cycles
    

    So my conclusion looks like: when memory access is kinda optimal, AVX can help with Tree's bigger than 200k Elements. Below that there is hardly any penalty to pay (if you dont use AVX for anything else). It's been worth the night of benchmarking this. Thanks to everybody involved :-)

    0 讨论(0)
  • 2020-12-25 11:03

    I've used SSE2/AVX2 to help perform a B+tree search. Here's code to perform a binary search on a full cache line of 16 DWORDs in AVX2:

    // perf-critical: ensure this is 64-byte aligned. (a full cache line)
    union bnode
    {
        int32_t i32[16];
        __m256i m256[2];
    };
    
    // returns from 0 (if value < i32[0]) to 16 (if value >= i32[15]) 
    unsigned bsearch_avx2(bnode const* const node, __m256i const value)
    {
        __m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);
    
        // compare the two halves of the cache line.
    
        __m256i cmp1 = _mm256_load_si256(&node->m256[0]);
        __m256i cmp2 = _mm256_load_si256(&node->m256[1]);
    
        cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
        cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD
    
        // merge the comparisons back together.
        //
        // a permute is required to get the pack results back into order
        // because AVX-256 introduced that unfortunate two-lane interleave.
        //
        // alternately, you could pre-process your data to remove the need
        // for the permute.
    
        __m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
        cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD
    
        // finally create a move mask and count trailing
        // zeroes to get an index to the next node.
    
        unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
        return _tzcnt_u32(mask) / 2; // TZCNT
    }
    

    You'll end up with a single highly predictable branch per bnode, to test if the end of the tree has been reached.

    This should be trivially scalable to AVX-512.

    To preprocess and get rid of that slow PERMD instruction, this would be used:

    void preprocess_avx2(bnode* const node)
    {
        __m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
        __m256i *const middle = (__m256i*)&node->i32[4];
    
        __m256i x = _mm256_loadu_si256(middle);
        x = _mm256_permutevar8x32_epi32(x, perm_mask);
        _mm256_storeu_si256(middle, x);
    }
    
    0 讨论(0)
提交回复
热议问题