When you and
some data with a mask you get some result which is of the same size as the data/mask.
What I want to do, is to take the masked bits in the result (where there was 1 in the mask) and shift them to the right so they are next to each other and I can perform a CTZ (Count Trailing Zeroes) on them.
I didn't know how to name such a procedure so Google has failed me. The operation should preferably not be a loop solution, this has to be as fast operation as possible.
And here is an incredible image made in MS Paint.
This operation is known as compress right. It is implemented as part of BMI2 as the PEXT
instruction, in Intel processors as of Haswell.
Unfortunately, without hardware support is it a quite annoying operation. Of course there is an obvious solution, just moving the bits one by one in a loop, here is the one given by Hackers Delight:
unsigned compress(unsigned x, unsigned m) {
unsigned r, s, b; // Result, shift, mask bit.
r = 0;
s = 0;
do {
b = m & 1;
r = r | ((x & b) << s);
s = s + b;
x = x >> 1;
m = m >> 1;
} while (m != 0);
return r;
}
But there is an other way, also given by Hackers Delight, which does less looping (number of iteration logarithmic in the number of bits) but more per iteration:
unsigned compress(unsigned x, unsigned m) {
unsigned mk, mp, mv, t;
int i;
x = x & m; // Clear irrelevant bits.
mk = ~m << 1; // We will count 0's to right.
for (i = 0; i < 5; i++) {
mp = mk ^ (mk << 1); // Parallel prefix.
mp = mp ^ (mp << 2);
mp = mp ^ (mp << 4);
mp = mp ^ (mp << 8);
mp = mp ^ (mp << 16);
mv = mp & m; // Bits to move.
m = m ^ mv | (mv >> (1 << i)); // Compress m.
t = x & mv;
x = x ^ t | (t >> (1 << i)); // Compress x.
mk = mk & ~mp;
}
return x;
}
Notice that a lot of the values there depend only on m
. Since you only have 512 different masks, you could precompute those and simplify the code to something like this (not tested)
unsigned compress(unsigned x, int maskindex) {
unsigned t;
int i;
x = x & masks[maskindex][0];
for (i = 0; i < 5; i++) {
t = x & masks[maskindex][i + 1];
x = x ^ t | (t >> (1 << i));
}
return x;
}
Of course all of these can be turned into "not a loop" by unrolling, the second and third ways are probably more suitable for that. That's a bit of cheat however.
You can use the pack-by-multiplication technique similar to the one described here. This way you don't need any loop and can mix the bits in any order.
For example with the mask 0b10101001 == 0xA9
like above and 8-bit data abcdefgh
(with a-h is the 8 bits) you can use the below expression to get 0000aceh
uint8_t compress_maskA9(uint8_t x)
{
const uint8_t mask1 = 0xA9 & 0xF0;
const uint8_t mask2 = 0xA9 & 0x0F;
return (((x & mask1)*0x03000000 >> 28) & 0x0C) | ((x & mask2)*0x50000000 >> 30);
}
In this specific case there are some overlaps of the 4 bits while adding (which incur unexpected carry) during the multiplication step, so I've split them into 2 parts, the first one extracts bit a and c, then e and h will be extracted in the latter part. There are other ways to split the bits as well, like a & h then c & e. You can see the results compared to Harold's function live on ideone
An alternate way with only one multiplication
const uint32_t X = (x << 8) | x;
return (X & 0x8821)*0x12050000 >> 28;
I got this by duplicating the bits so that they're spaced out farther, leaving enough space to avoid the carry. This is often better than splitting into 2 multiplications
If you want the result's bits reversed (i.e. heca0000
) you can easily change the magic numbers accordingly
// result: he00 | 00ca;
return (((x & 0x09)*0x88000000 >> 28) & 0x0C) | (((x & 0xA0)*0x04800000) >> 30);
or you can also extract the 3 bits e, c and a at the same time, leaving h separately (as I mentioned above, there are often multiple solutions) and you need only one multiplication
return ((x & 0xA8)*0x12400000 >> 29) | (x & 0x01) << 3; // result: 0eca | h000
But there might be a better alternative like the above second snippet
const uint32_t X = (x << 8) | x;
return (X & 0x2881)*0x80290000 >> 28
Correctness check: http://ideone.com/PYUkty
For a larger number of masks you can precompute the magic numbers correspond to those masks and store them in an array so that you can look them up immediately for use. I calculated those mask by hand but you can do that automatically
Explanation
We have abcdefgh & mask1 = a0c00000
. Multiply it with magic1
........................a0c00000
× 00000011000000000000000000000000 (magic1 = 0x03000000)
────────────────────────────────
a0c00000........................
+ a0c00000......................... (the leading "a" bit is outside int's range
──────────────────────────────── so it'll be truncated)
r1 = acc.............................
=> (r1 >> 28) & 0x0C = 0000ac00
Similarly we multiply abcdefgh & mask2 = 0000e00h
with magic2
........................0000e00h
× 01010000000000000000000000000000 (magic2 = 0x50000000)
────────────────────────────────
e00h............................
+ 0h..............................
────────────────────────────────
r2 = eh..............................
=> (r2 >> 30) = 000000eh
Combine them together we have the expected result
((r1 >> 28) & 0x0C) | (r2 >> 30) = 0000aceh
And here's the demo for the second snippet
abcdefghabcdefgh
& 1000100000100001 (0x8821)
────────────────────────────────
a000e00000c0000h
× 00010010000001010000000000000000 (0x12050000)
────────────────────────────────
000h
00e00000c0000h
+ 0c0000h
a000e00000c0000h
────────────────────────────────
= acehe0h0c0c00h0h
& 11110000000000000000000000000000
────────────────────────────────
= aceh
For the reversed order case:
abcdefghabcdefgh
& 0010100010000001 (0x2881)
────────────────────────────────
00c0e000a000000h
x 10000000001010010000000000000000 (0x80290000)
────────────────────────────────
000a000000h
00c0e000a000000h
+ 0e000a000000h
h
────────────────────────────────
hecaea00a0h0h00h
& 11110000000000000000000000000000
────────────────────────────────
= heca
Related:
来源:https://stackoverflow.com/questions/28282869/shift-masked-bits-to-the-lsb