问题
I'm interested in a fast method for "expanding bits," which can be defined as the following:
- Let B be a binary number with n bits, i.e. B \in {0,1}^n
- Let P be the position of all 1/true bits in B, i.e.
1 << p[i] & B == 1
, and |P|=k - For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that
Ap[j] == A[j] << p[j]
. - The result of the "bit expansion" is Ap.
A couple examples:
- Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
- Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
回答1:
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP
, which takes a source operand a
, and deposits its bits based on the positions of the 1-bits of a mask b
. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b
in general is sparse, and that we can minimize work by only iterating over the 1-bits of b
. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b
has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i
can be replaced by addition: introduce a variable m
that is initialized to 1
before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b
in turn (starting at the least significant end), AND it with the value of the i-th bit of a
, and incorporate the result into the expanded destination. After a 1-bit from b
has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a
into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, @Evgeny Kluev pointed to a shift-free PDEP
emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.
来源:https://stackoverflow.com/questions/35879269/bit-twiddle-help-expanding-bits-to-follow-a-given-bitmask