What is the most efficient algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits
Anders Cedronius's answer provides a great solution for people that have an x86 CPU with AVX2 support. For x86 platforms without AVX support or non-x86 platforms, either of the following implementations should work well.
The first code is a variant of the classic binary partitioning method, coded to maximize the use of the shift-plus-logic idiom useful on various ARM processors. In addition, it uses on-the-fly mask generation which could be beneficial for RISC processors that otherwise require multiple instructions to load each 32-bit mask value. Compilers for x86 platforms should use constant propagation to compute all masks at compile time rather than run time.
/* Classic binary partitioning algorithm */
inline uint32_t brev_classic (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = m^(m << 4); a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = m^(m << 2); a = ((a >> 2) & m) | ((a << 2) & ~m);
m = m^(m << 1); a = ((a >> 1) & m) | ((a << 1) & ~m);
return a;
}
In volume 4A of "The Art of Computer Programming", D. Knuth shows clever ways of reversing bits that somewhat surprisingly require fewer operations than the classical binary partitioning algorithms. One such algorithm for 32-bit operands, that I cannot find in TAOCP, is shown in this document on the Hacker's Delight website.
/* Knuth's algorithm from http://www.hackersdelight.org/revisions.pdf. Retrieved 8/19/2015 */
inline uint32_t brev_knuth (uint32_t a)
{
uint32_t t;
a = (a << 15) | (a >> 17);
t = (a ^ (a >> 10)) & 0x003f801f;
a = (t + (t << 10)) ^ a;
t = (a ^ (a >> 4)) & 0x0e038421;
a = (t + (t << 4)) ^ a;
t = (a ^ (a >> 2)) & 0x22488842;
a = (t + (t << 2)) ^ a;
return a;
}
Using the Intel compiler C/C++ compiler 13.1.3.198, both of the above functions auto-vectorize nicely targetting XMM
registers. They could also be vectorized manually without a lot of effort.
On my IvyBridge Xeon E3 1270v2, using the auto-vectorized code, 100 million uint32_t
words were bit-reversed in 0.070 seconds using brev_classic()
, and 0.068 seconds using brev_knuth()
. I took care to ensure that my benchmark was not limited by system memory bandwidth.