I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.
Currently if I have the
You could keep the outer loop unchanged. Also it is better to use .x
as inner most loop so you can access the global memory efficiently.
__global__ void kernelExample() {
int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
for(int i = 0; i < a; i++) {
A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
}
}
However if your a,b,c,d
are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.
__global__ void kernelExample() {
int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
int i = tid / (b*c*d);
int j = tid / (c*d) % b;
int k = tid / d % c;
int l = tid % d;
A[i*x*y*z + j*y*z + k*z + l] = 1;
}
But be careful that calculating i,j,k,l
may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j
to .z
and .y
, and calculate only k,l
and more dimensions from .x
in a similar way.