dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
I found this code in Professional CUDA C Programming on page 53. It\'s mean
This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as ceil(nx/block.x)
, that is, figure out how many blocks are needed to cover the desired size, then round up.
But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.
Try a few examples: If nx = 10
, then nx + block.x - 1
is 13, and by integer divison, you need 3 blocks of size 4.
As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3