There's no way to answer this in general; it all depends on the compiler
and the underlying architecture. The only real way to know is to try
different solutions, and measure. (On some machines, for example,
shifts can be very expensive. On others, no.) For starters, I'd use
something like:
uint64_t mask = 1;
int index = 0;
while ( mask != 0 ) {
if ( (bits & mask) != 0 ) {
++ bit_counter[index];
}
++ index;
mask <<= 1;
}
Unrolling the loop completely will likely improve performance.
Depending on the architecture, replacing the if
with:
bit_counter[index] += ((bits & mask) != 0);
might be better. Or worse... it's impossible to know in advance. It's
also possible that on some machines, systematically shifting into the
low order bit and masking, as you are doing, would be best.
Some optimizations will also depend on what typical data looks like. If
most of the words only have one or two bits set, you might gain by
testing a byte at at time, or four bits at a time, and skipping those
that are all zeros completely.