I have an array of bytes where each byte is either 0 or 1. Now I want to pack these values into bits, so that 8 original bytes occupy 1 target byte, with original byte 0 going into bit 0, byte 1 into bit 1, etc. So far I have the following in the kernel:
const uint16_t tid = threadIdx.x;
__shared__ uint8_t packing[cBlockSize];
// ... Computation of the original bytes in packing[tid]
__syncthreads();
if ((tid & 4) == 0)
{
packing[tid] |= packing[tid | 4] << 4;
}
if ((tid & 6) == 0)
{
packing[tid] |= packing[tid | 2] << 2;
}
if ((tid & 7) == 0)
{
pOutput[(tid + blockDim.x*blockIdx.x)>>3] = packing[tid] | (packing[tid | 1] << 1);
}
Is this correct and efficient?
The __ballot()
warp-voting function comes quite handy for this.
Assuming that you can redefine pOutput
to be of uint32_t
type, and that your block size is a multiple of the warp size (32):
unsigned int target = __ballot(packing[tid]);
if (tid % warpSize == 0) {
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
}
Strictly speaking, the if conditional isn't even necessary, as all threads of the warp will write the same data to the same address. So a highly optimized version would just be
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = __ballot(packing[tid]);
For two bits per thread, using uint2 *pOutput
int lane = tid % warpSize;
uint2 target;
target.x = __ballot(__shfl(packing[tid], lane / 2) & (lane & 1) + 1));
target.y = __ballot(__shfl(packing[tid], lane / 2 + warpSize / 2) & (lane & 1) + 1));
pOutput[(tid + blockDim.x*blockIdx.x) / warpSize] = target;
You'll have to benchmark whether this is still faster than your conventional solution.
来源:https://stackoverflow.com/questions/39488441/how-to-pack-bits-efficiently-in-cuda