I am trying to write very efficient Hamming-distance code. Inspired by Wojciech Muła\'s extremely clever SSE3 popcount implementation, I coded an AVX2 equivale
You should consider using the usual _mm_popcnt_u64
instruction instead of hacking it in SSE or AVX. I tested all methods for popcounting thoroughly, including an SSE and AVX version (which ultimately led to my more or less famous question about popcount). _mm_popcnt_u64
outperforms SSE and AVX considerably, especially when you use a compiler which prevents the Intel popcount bug discovered in my question. Without the bug, my Haswell is able to popcount 26 GB/s which almost hits the bus bandwidth.
The reason why _mm_popcnt_u64
is faster is simply due to the fact that it popcounts 64 bits at once (so already 1/4 of the AVX version) while requiring only one cheap processor instruction. It costs only a few cycles (latency 3, throughput 1 for Intel). Even if every AVX instruction you use required only one cycle, you would still get worse results due to the shear amount of instructions necessary for popcounting 256 bits.
Try this, it should be fastest:
int popcount256(const uint64_t* u){
return _mm_popcnt_u64(u[0]);
+ _mm_popcnt_u64(u[1]);
+ _mm_popcnt_u64(u[2]);
+ _mm_popcnt_u64(u[3]);
}
I know this does not answer your core question why AVX is slower, but since your ultimate goal is fast popcount, the AVX <-> SSE comparison is irrelevant as both are inferior to the builtin popcount.