What is the most efficient algorithm to achieve the following:
0010 0000 => 0000 0100
The conversion is from MSB->LSB to LSB->MSB. All bits
I thought this is one of the simplest way to reverse the bit. please let me know if there is any flaw in this logic. basically in this logic, we check the value of the bit in position. set the bit if value is 1 on reversed position.
void bit_reverse(ui32 *data)
{
ui32 temp = 0;
ui32 i, bit_len;
{
for(i = 0, bit_len = 31; i <= bit_len; i++)
{
temp |= (*data & 1 << i)? (1 << bit_len-i) : 0;
}
*data = temp;
}
return;
}
Native ARM instruction "rbit" can do it with 1 cpu cycle and 1 extra cpu register, impossible to beat.
I know it isn't C but asm:
var1 dw 0f0f0
clc
push ax
push cx
mov cx 16
loop1:
shl var1
shr ax
loop loop1
pop ax
pop cx
This works with the carry bit, so you may save flags too
I think the simplest method I know follows. MSB
is input and LSB
is 'reversed' output:
unsigned char rev(char MSB) {
unsigned char LSB=0; // for output
_FOR(i,0,8) {
LSB= LSB << 1;
if(MSB&1) LSB = LSB | 1;
MSB= MSB >> 1;
}
return LSB;
}
// It works by rotating bytes in opposite directions.
// Just repeat for each byte.
It seems that many other posts are concerned about speed (i.e best = fastest). What about simplicity? Consider:
char ReverseBits(char character) {
char reversed_character = 0;
for (int i = 0; i < 8; i++) {
char ith_bit = (c >> i) & 1;
reversed_character |= (ith_bit << (sizeof(char) - 1 - i));
}
return reversed_character;
}
and hope that clever compiler will optimise for you.
If you want to reverse a longer list of bits (containing sizeof(char) * n
bits), you can use this function to get:
void ReverseNumber(char* number, int bit_count_in_number) {
int bytes_occupied = bit_count_in_number / sizeof(char);
// first reverse bytes
for (int i = 0; i <= (bytes_occupied / 2); i++) {
swap(long_number[i], long_number[n - i]);
}
// then reverse bits of each individual byte
for (int i = 0; i < bytes_occupied; i++) {
long_number[i] = ReverseBits(long_number[i]);
}
}
This would reverse [10000000, 10101010] into [01010101, 00000001].
Anders Cedronius's answer provides a great solution for people that have an x86 CPU with AVX2 support. For x86 platforms without AVX support or non-x86 platforms, either of the following implementations should work well.
The first code is a variant of the classic binary partitioning method, coded to maximize the use of the shift-plus-logic idiom useful on various ARM processors. In addition, it uses on-the-fly mask generation which could be beneficial for RISC processors that otherwise require multiple instructions to load each 32-bit mask value. Compilers for x86 platforms should use constant propagation to compute all masks at compile time rather than run time.
/* Classic binary partitioning algorithm */
inline uint32_t brev_classic (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = m^(m << 4); a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = m^(m << 2); a = ((a >> 2) & m) | ((a << 2) & ~m);
m = m^(m << 1); a = ((a >> 1) & m) | ((a << 1) & ~m);
return a;
}
In volume 4A of "The Art of Computer Programming", D. Knuth shows clever ways of reversing bits that somewhat surprisingly require fewer operations than the classical binary partitioning algorithms. One such algorithm for 32-bit operands, that I cannot find in TAOCP, is shown in this document on the Hacker's Delight website.
/* Knuth's algorithm from http://www.hackersdelight.org/revisions.pdf. Retrieved 8/19/2015 */
inline uint32_t brev_knuth (uint32_t a)
{
uint32_t t;
a = (a << 15) | (a >> 17);
t = (a ^ (a >> 10)) & 0x003f801f;
a = (t + (t << 10)) ^ a;
t = (a ^ (a >> 4)) & 0x0e038421;
a = (t + (t << 4)) ^ a;
t = (a ^ (a >> 2)) & 0x22488842;
a = (t + (t << 2)) ^ a;
return a;
}
Using the Intel compiler C/C++ compiler 13.1.3.198, both of the above functions auto-vectorize nicely targetting XMM
registers. They could also be vectorized manually without a lot of effort.
On my IvyBridge Xeon E3 1270v2, using the auto-vectorized code, 100 million uint32_t
words were bit-reversed in 0.070 seconds using brev_classic()
, and 0.068 seconds using brev_knuth()
. I took care to ensure that my benchmark was not limited by system memory bandwidth.