I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition.
So the idea was that the input data can be reduced by fact
I think the reason your code is slower than mine is that in my code, half as many warps are active for each ADD in the first phase. In your code, all warps are active for all of the first phase. So overall your code executes more warp instructions. In CUDA it's important to consider total "warp instructions" executed, not just the number of instructions executed by one warp.
Also, there's no point in only using half of your warps. There is overhead in launching the warps only to have them evaluate two branches and exit.
Another thought is that the use of unsigned char
and short
might actually be costing you performance. I'm not sure, but it's certainly not saving you registers since they are not packed into single 32-bit variables.
Also, in my original code, I replaced blockDim.x with a template parameter, BLOCKDIM, which means that it only used 5 run-time if statements (the ifs in the second stage are eliminated by the compiler).
BTW, a cheaper way to compute your threadWarpId
is
const int threadWarpId = threadIdx.x & 31;
You might check this article for more ideas.
EDIT: Here's an alternative warp-based block reduction.
template
__device__
void sumReduceWarp(volatile T *sdata, const unsigned int tid)
{
T t = sdata[tid];
if (level > 5) sdata[tid] = t = t + sdata[tid + 32];
if (level > 4) sdata[tid] = t = t + sdata[tid + 16];
if (level > 3) sdata[tid] = t = t + sdata[tid + 8];
if (level > 2) sdata[tid] = t = t + sdata[tid + 4];
if (level > 1) sdata[tid] = t = t + sdata[tid + 2];
if (level > 0) sdata[tid] = t = t + sdata[tid + 1];
}
template
__device__
void sumReduceBlock(T *output, volatile T *sdata)
{
// sdata is a shared array of length 2 * blockDim.x
const unsigned int warp = threadIdx.x >> 5;
const unsigned int lane = threadIdx.x & 31;
const unsigned int tid = (warp << 6) + lane;
sumReduceWarp(sdata, tid);
__syncthreads();
// lane 0 of each warp now contains the sum of two warp's values
if (lane == 0) sdata[warp] = sdata[tid];
__syncthreads();
if (warp == 0) {
sumReduceWarp(sdata, threadIdx.x);
if (lane == 0) *output = sdata[0];
}
}
This should be a bit faster because it uses all the warps that are launched in the first stage, and has no branching within the last stage, at the cost of an extra branch, shared load/store and __syncthreads()
in the new middle stage. I haven't tested this code. If you run it, let me know how it performs. If you use a template for the blockDim in your original code it may again be faster, but I think this code is more succinct.
Note the temporary variable t
is used because Fermi and later architectures use a pure load/store architecture, so +=
from shared memory to shared memory results in an extra load (since the sdata
pointer must be volatile). Explicitly loading into the temporary once avoids this. On G80 it won't make a difference to performance.