I am working on some cuda tutorial converting a RGBA picture to greyscale.
But I couldn\'t figure out why changing the blockSize
and gridSize
makes a X
Neither of the grid/block configuration is recommended. The first one is not scale-able because the number of threads per block is limited for the GPU therefore it will eventually fail for larger image size. The second one is a poor choice because there is only 1 thread per block which is not recommended as the GPU occupancy would be very low. You can verify it through the GPU Occupancy Calculator included with the CUDA Toolkit. The recommended block size should be a multiple of GPU warp size (16 or 32) depending on the GPU.
A general and scale-able approach for 2D grid and block size in your case would be something like this:
const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols + blockSize.x - 1)/blockSize.x, (numRows + blockSize.y - 1)/blockSize.y , 1);
You can change the block size from 16 x 16 to any size you like, provided you keep in the limits of the device. Maximum 512 threads per block is allowed for devices of compute capability 1.0 to 1.3. For device of compute capability 2.0 onward, this limit is 1024 threads per block.
As now, the grid and block are 2 dimensional, the indexing inside the kernel would be modified as follows:
int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
int j = blockIdx.y * blockDim.y + threadIdx.y; //Row
int idx = j * numCols + i;
//Don't forget to perform bound checks
if(i>=numCols || j>=numRows) return;
float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f * rgbaImage[idx].z;
greyImage[idx]= channelSum;