I get what blockDim
is, but I have a problem with gridDim. Blockdim
gives the size of the block, but what is gridDim
? On the Internet
In this source code, we even have 4 threds, the kernel function can access all of 10 arrays. How?
#define N 10 //(33*1024)
__global__ void add(int *c){
int tid = threadIdx.x + blockIdx.x * gridDim.x;
if(tid < N)
c[tid] = 1;
while( tid < N)
{
c[tid] = 1;
tid += blockDim.x * gridDim.x;
}
}
int main(void)
{
int c[N];
int *dev_c;
cudaMalloc( (void**)&dev_c, N*sizeof(int) );
for(int i=0; i<N; ++i)
{
c[i] = -1;
}
cudaMemcpy(dev_c, c, N*sizeof(int), cudaMemcpyHostToDevice);
add<<< 2, 2>>>(dev_c);
cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost );
for(int i=0; i< N; ++i)
{
printf("c[%d] = %d \n" ,i, c[i] );
}
cudaFree( dev_c );
}
Why we do not create 10 threads ex) add<<<2,5>>> or add<5,2>>>
Because we have to create reasonably small number of threads, if N is larger than 10 ex) 33*1024.
This source code is example of this case. arrays are 10, cuda threads are 4. How to access all 10 arrays only by 4 threads.
see the page about meaning of threadIdx, blockIdx, blockDim, gridDim in the cuda detail.
In this source code,
gridDim.x : 2 this means number of block of x
gridDim.y : 1 this means number of block of y
blockDim.x : 2 this means number of thread of x in a block
blockDim.y : 1 this means number of thread of y in a block
Our number of thread are 4, because 2*2(blocks * thread).
In add kernel function, we can access 0, 1, 2, 3 index of thread
->tid = threadIdx.x + blockIdx.x * blockDim.x
①0+0*2=0
②1+0*2=1
③0+1*2=2
④1+1*2=3
How to access rest of index 4, 5, 6, 7, 8, 9. There is a calculation in while loop
tid += blockDim.x + gridDim.x in while
** first call of kernel **
-1 loop: 0+2*2=4
-2 loop: 4+2*2=8
-3 loop: 8+2*2=12 ( but this value is false, while out!)
** second call of kernel **
-1 loop: 1+2*2=5
-2 loop: 5+2*2=9
-3 loop: 9+2*2=13 ( but this value is false, while out!)
** third call of kernel **
-1 loop: 2+2*2=6
-2 loop: 6+2*2=10 ( but this value is false, while out!)
** fourth call of kernel **
-1 loop: 3+2*2=7
-2 loop: 7+2*2=11 ( but this value is false, while out!)
So, all index of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 can access by tid value.
refer to this page. http://study.marearts.com/2015/03/to-process-all-arrays-by-reasonably.html I cannot upload image, because low reputation.
First, see this figure Grid of thread blocks from the CUDA official document
Usually, we use kernel as this:
__global__ void kernelname(...){
const id_x = blockDim.x * blockIdx.x + threadIdx.x;
const id_y = blockDim.y * blockIdx.y + threadIdx.y;
...
}
// invoke kernel
// assume we have assigned the proper gridsize and blocksize
kernelname<<<gridsize, blocksize>>>(...)
The meaning of some variables:
gridsize
the number of blocks per grid, corresponding to the gridDim
blocksize
the number of threads per block, corresponding to the blockDim
threadIdx.x
varies in [0, blockDim.x)
blockIdx.x
varies in [0, gridDim.x)
So, let's try to calculate the index at the x direction when we have threadIdx.x
and blockIdx.x
. According to the figure, blockIdx.x
determines which block you are, and threadIdx.x
determines which thread you are when the location of block is given. Hence, we have:
which_blk = blockDim.x * blockIdx.x; // which block you are
final_index_x = which_blk + threadIdx.x; // based on the given block, we can have the final location by adding the threadIdx.x
that is:
final_index_x = blockDim.x * blockIdx.x + threadIdx.x;
which is same as the sample code above.
Similarly, we can get the index at y or z direction respectively.
As we can see, we usually don't use gridDim
in our code, because this information is performed as the range of blockIdx
. On the contrary, we have to use blockDim
although this information is performed as the range of threadIdx
. The reason I have shown above step by step.
I hope this answer may help resolve your confusion.
Paraphrased from the CUDA Programming Guide:
gridDim: This variable contains the dimensions of the grid.
blockIdx: This variable contains the block index within the grid.
blockDim: This variable and contains the dimensions of the block.
threadIdx: This variable contains the thread index within the block.
You seem to be a bit confused about the thread hierachy that CUDA has; in a nutshell, for a kernel there will be 1 grid, (which I always visualize as a 3-dimensional cube). Each of its elements is a block, such that a grid declared as dim3 grid(10, 10, 2);
would have 10*10*2 total blocks. In turn, each block is a 3-dimensional cube of threads.
With that said, it's common to only use the x-dimension of the blocks and grids, which is what it looks like the code in your question is doing. This is especially revlevant if you're working with 1D arrays. In that case, your tid+=blockDim.x * gridDim.x
line would in effect be the unique index of each thread within your grid. This is because your blockDim.x
would be the size of each block, and your gridDim.x
would be the total number of blocks.
So if you launch a kernel with parameters
dim3 block_dim(128,1,1);
dim3 grid_dim(10,1,1);
kernel<<<grid_dim,block_dim>>>(...);
then in your kernel had threadIdx.x + blockIdx.x*blockDim.x
you would effectively have:
threadIdx.x range from [0 ~ 128)
blockIdx.x range from [0 ~ 10)
blockDim.x equal to 128
gridDim.x equal to 10
Hence in calculating threadIdx.x + blockIdx.x*blockDim.x
, you would have values within the range defined by: [0, 128) + 128 * [1, 10)
, which would mean your tid values would range from {0, 1, 2, ..., 1279}.
This is useful for when you want to map threads to tasks, as this provides a unique identifier for all of your threads in your kernel.
However, if you have
int tid = threadIdx.x + blockIdx.x * blockDim.x;
tid += blockDim.x * gridDim.x;
then you'll essentially have: tid = [0, 128) + 128 * [1, 10) + (128 * 10)
, and your tid values would range from {1280, 1281, ..., 2559}
I'm not sure where that would be relevant, but it all depends on your application and how you map your threads to your data. This mapping is pretty central to any kernel launch, and you're the one who determines how it should be done. When you launch your kernel you specify the grid and block dimensions, and you're the one who has to enforce the mapping to your data inside your kernel. As long as you don't exceed your hardware limits (for modern cards, you can have a maximum of 2^10 threads per block and 2^16 - 1 blocks per grid)
blockDim.x,y,z
gives the number of threads in a block, in the
particular directiongridDim.x,y,z
gives the number of blocks in a grid, in the
particular directionblockDim.x * gridDim.x
gives the number of threads in a grid (in the x direction, in this case)block and grid variables can be 1, 2, or 3 dimensional. It's common practice when handling 1-D data to only create 1-D blocks and grids.
In the CUDA documentation, these variables are defined here
In particular, when the total threads in the x-dimension (gridDim.x*blockDim.x
) is less than the size of the array I wish to process, then it's common practice to create a loop and have the grid of threads move through the entire array. In this case, after processing one loop iteration, each thread must then move to the next unprocessed location, which is given by tid+=blockDim.x*gridDim.x;
In effect, the entire grid of threads is jumping through the 1-D array of data, a grid-width at a time. This topic, sometimes called a "grid-striding loop", is further discussed in this blog article.
You might want to consider taking a couple of the introductory CUDA webinars available on the NVIDIA webinar page. For example, these 2:
It would be 2 hours well spent, if you want to understand these concepts better.
The general topic of grid-striding loops is covered in some detail here.