I am currently researching whether it would be possible to speed up a van Emde Boas (or any tree) tree traversal. Given a single search query as input, already having multip
Based on your code, i've went ahead and benchmarked 3 options: AVX2-powered, nested branching (4 jumps) and a branchless variant. These are the results:
// Performance Table... // All using cache-line size 64byteAligned chunks (van Emde-Boas Layout); loop unrolled per cacheline; // all optimizations turned on. Each Element being 4 byte's. Intel i7 4770k Haswell @3.50GHz
Type ElementAmount LoopCount Avg. Cycles / Query
===================================================================
AVX2 210485750 100000000 610 cycles
AVX2 21048575 100000000 427 cycles
AVX2 2104857 100000000 288 cycles
AVX2 210485 100000000 157 cycles
AVX2 21048 100000000 95 cycles
AVX2 2104 100000000 49 cycles
AVX2 210 100000000 17 cycles
AVX2 100 100000000 16 cycles
Type ElementAmount LoopCount Avg. Cycles / Query
===================================================================
Branching 210485750 100000000 819 cycles
Branching 21048575 100000000 594 cycles
Branching 2104857 100000000 358 cycles
Branching 210485 100000000 165 cycles
Branching 21048 100000000 82 cycles
Branching 2104 100000000 49 cycles
Branching 210 100000000 21 cycles
Branching 100 100000000 16 cycles
Type ElementAmount LoopCount Avg. Cycles / Query
===================================================================
BranchLESS 210485750 100000000 675 cycles
BranchLESS 21048575 100000000 602 cycles
BranchLESS 2104857 100000000 417 cycles
BranchLESS 210485 100000000 273 cycles
BranchLESS 21048 100000000 130 cycles
BranchLESS 2104 100000000 72 cycles
BranchLESS 210 100000000 27 cycles
BranchLESS 100 100000000 18 cycles
So my conclusion looks like: when memory access is kinda optimal, AVX can help with Tree's bigger than 200k Elements. Below that there is hardly any penalty to pay (if you dont use AVX for anything else). It's been worth the night of benchmarking this. Thanks to everybody involved :-)
I've used SSE2/AVX2 to help perform a B+tree search. Here's code to perform a binary search on a full cache line of 16 DWORDs in AVX2:
// perf-critical: ensure this is 64-byte aligned. (a full cache line)
union bnode
{
int32_t i32[16];
__m256i m256[2];
};
// returns from 0 (if value < i32[0]) to 16 (if value >= i32[15])
unsigned bsearch_avx2(bnode const* const node, __m256i const value)
{
__m256i const perm_mask = _mm256_set_epi32(7, 6, 3, 2, 5, 4, 1, 0);
// compare the two halves of the cache line.
__m256i cmp1 = _mm256_load_si256(&node->m256[0]);
__m256i cmp2 = _mm256_load_si256(&node->m256[1]);
cmp1 = _mm256_cmpgt_epi32(cmp1, value); // PCMPGTD
cmp2 = _mm256_cmpgt_epi32(cmp2, value); // PCMPGTD
// merge the comparisons back together.
//
// a permute is required to get the pack results back into order
// because AVX-256 introduced that unfortunate two-lane interleave.
//
// alternately, you could pre-process your data to remove the need
// for the permute.
__m256i cmp = _mm256_packs_epi32(cmp1, cmp2); // PACKSSDW
cmp = _mm256_permutevar8x32_epi32(cmp, perm_mask); // PERMD
// finally create a move mask and count trailing
// zeroes to get an index to the next node.
unsigned mask = _mm256_movemask_epi8(cmp); // PMOVMSKB
return _tzcnt_u32(mask) / 2; // TZCNT
}
You'll end up with a single highly predictable branch per bnode
, to test if the end of the tree has been reached.
This should be trivially scalable to AVX-512.
To preprocess and get rid of that slow PERMD
instruction, this would be used:
void preprocess_avx2(bnode* const node)
{
__m256i const perm_mask = _mm256_set_epi32(3, 2, 1, 0, 7, 6, 5, 4);
__m256i *const middle = (__m256i*)&node->i32[4];
__m256i x = _mm256_loadu_si256(middle);
x = _mm256_permutevar8x32_epi32(x, perm_mask);
_mm256_storeu_si256(middle, x);
}