发表新帖

发表新帖

Parallelize four and more nested loops with CUDA

后端未结

关注

 1  863

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.

Currently if I have the

相关标签:

1条回答

滥情空心

2021-01-01 09:18
You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.
```
__global__ void kernelExample() {
    int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
    int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
    int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
    for(int i = 0; i < a; i++) {
        A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
    }
}
```
However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.
```
__global__ void kernelExample() {
    int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
    int i = tid / (b*c*d);
    int j = tid / (c*d) % b;
    int k = tid / d % c;
    int l = tid % d;

    A[i*x*y*z + j*y*z + k*z + l] = 1;
}
```
But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题