I just started in CUDA. Now I have a question. I have N*N matrix, and a window scale is 8x8. I want subdivided this matrix into multiple sub-matrix and find max value of thi
With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.
For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.
For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.
Alternatively I would suggest you use 1-D thread block with 256 threads. Every m
threads reduce one m
xm
window. Then you could use 2-D grid to cover the whole image.
const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);
In the kernel function, you could
m
xm
window to a 1xm
vector during global data loading;m
vector to a scalar.This following code is a conceptual demo which works when m
is a power of 2 and m <= 32
. You could further modify it for arbitrary m
and better boundary checking.
#include
#include
#include
__global__ void calculate_emax_kernel(const float* input, float* output,
int height, int width, int win_size,
int out_width) {
const int tid = threadIdx.x;
const int i = blockIdx.y * win_size;
const int j = blockIdx.x * 256 + tid;
const int win_id = j % win_size;
__shared__ float smax[256];
float tmax = -1e20;
if (j < width) {
for (int tile = 0; tile < win_size; tile++) {
if (i + tile < height) {
tmax = max(tmax, input[(i + tile) * width + j]);
}
}
}
smax[tid] = tmax;
for (int shift = win_size / 2; shift > 0; shift /= 2) {
if (win_id < shift) {
smax[tid] = max(smax[tid], smax[tid + shift]);
}
}
if (win_id == 0 && j < width) {
output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
}
}
int main() {
const int height = 1024;
const int width = 1024;
const int m = 4;
thrust::device_vector in(height * width);
thrust::device_vector out(
((height + m - 1) / m) * ((width + m - 1) / m));
dim3 blocksize(256);
dim3 gridsize((width + 255) / 256, (height + m - 1) / m);
assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
calculate_emax_kernel<<>>(
thrust::raw_pointer_cast(in.data()),
thrust::raw_pointer_cast(out.data()),
height, width, m, (width + m - 1) / m);
return 0;
}