Why does changing the block and grid sizes have such great impact on runtime?

后端 未结 1 575
轮回少年
轮回少年 2021-01-24 18:44

I am working on some cuda tutorial converting a RGBA picture to greyscale. But I couldn\'t figure out why changing the blockSize and gridSize makes a X

1条回答
  •  伪装坚强ぢ
    2021-01-24 19:18

    Neither of the grid/block configuration is recommended. The first one is not scale-able because the number of threads per block is limited for the GPU therefore it will eventually fail for larger image size. The second one is a poor choice because there is only 1 thread per block which is not recommended as the GPU occupancy would be very low. You can verify it through the GPU Occupancy Calculator included with the CUDA Toolkit. The recommended block size should be a multiple of GPU warp size (16 or 32) depending on the GPU.

    A general and scale-able approach for 2D grid and block size in your case would be something like this:

    const dim3 blockSize(16, 16, 1);
    const dim3 gridSize((numCols + blockSize.x - 1)/blockSize.x, (numRows + blockSize.y - 1)/blockSize.y , 1);
    

    You can change the block size from 16 x 16 to any size you like, provided you keep in the limits of the device. Maximum 512 threads per block is allowed for devices of compute capability 1.0 to 1.3. For device of compute capability 2.0 onward, this limit is 1024 threads per block.

    As now, the grid and block are 2 dimensional, the indexing inside the kernel would be modified as follows:

    int i = blockIdx.x * blockDim.x + threadIdx.x; //Column
    int j = blockIdx.y * blockDim.y + threadIdx.y; //Row
    
    int idx = j * numCols + i;
    
    //Don't forget to perform bound checks
    if(i>=numCols || j>=numRows) return;
    
    float channelSum = .299f * rgbaImage[idx].x + .587f * rgbaImage[idx].y + .114f *     rgbaImage[idx].z;
    greyImage[idx]= channelSum;
    

    0 讨论(0)
提交回复
热议问题