Bit popcount for large buffer, with Core 2 CPU (SSSE3)

ε祈祈猫儿з 提交于 2019-11-27 14:06:29
Ira Baxter

See a 32 bit version in the AMD Software Optimization guide, page 195 for one implementation. This gives you assembly code for an x86 directly.

See a variant at Stanford bit-twiddling hacks The Stanford version looks like the best one to me. It looks very easy to code as x86 asm.

Neither of these use branch instructions.

These can be generalized to 64 bit versions.

With the 32 or 64 bit versions, you might consider doing a SIMD version. SSE2 will do 4 double-words or two quadwords (either way 128 bits) at once. What you want to do is implement the popcount for 32 or 64 bits in each of the 2 or 4 registers available. You'll end up with 2 or 4 sets of popcounts in the XMM registers when you are done; final step is to store and add those popcounts together to get the final answer. Guessing, I'd expect you do so slightly better doing 4 parallel 32 bit popcounts rather than 2 parallel 64 bit popcounts, as the latter is likely to take 1 or 2 additional instructions in each iteration, and its easy to add 4, 32 bit values together the end.

I outline the best C/assembly functions I found for population count/Hamming weight of large buffers below.

The fastest assembly function is ssse3_popcount3, described here. It requires SSSE3, available on Intel Core 2 and later, and AMD chipsets arriving in 2011. It uses SIMD instructions to popcount in 16 byte chunks and unrolls 4 loop iterations at a time.

The fastest C function is popcount_24words, described here. It uses the bit-slicing algorithm. Of note I found that clang could actually generate appropriate vector assembly instructions, which gave impressive performance increases. This aside, the algorithm is still extremely fast.

I would suggest implementing one of the optimised 32 bit popcnt routines from Hacker's Delight, but do it for 4 x 32 bit integer elements in an SSE vector. You can then process 128 bits per iteration, which should give you around 4x throughput compared to an optimised 32 bit scalar routine.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!