Parallelize four and more nested loops with CUDA

后端 未结 1 863
深忆病人
深忆病人 2021-01-01 08:47

I am working on a compiler generating parallel C++ code. I am new to CUDA programming but I am trying to parallelize the C++ code with CUDA.

Currently if I have the

相关标签:
1条回答
  • 2021-01-01 09:18

    You could keep the outer loop unchanged. Also it is better to use .x as inner most loop so you can access the global memory efficiently.

    __global__ void kernelExample() {
        int _cu_x = ((blockIdx.x*blockDim.x)+threadIdx.x);
        int _cu_y = ((blockIdx.y*blockDim.y)+threadIdx.y);
        int _cu_z = ((blockIdx.z*blockDim.z)+threadIdx.z);
        for(int i = 0; i < a; i++) {
            A[i*x*y*z + _cu_z*y*z + _cu_y*z + _cu_x] = 1;
        }
    }
    

    However if your a,b,c,d are all very small, you may not be able to get enough parallelism. In that case you could convert a linear index to n-D indices.

    __global__ void kernelExample() {
        int tid = ((blockIdx.x*blockDim.x)+threadIdx.x);
        int i = tid / (b*c*d);
        int j = tid / (c*d) % b;
        int k = tid / d % c;
        int l = tid % d;
    
        A[i*x*y*z + j*y*z + k*z + l] = 1;
    }
    

    But be careful that calculating i,j,k,l may introduce a lot of overhead as integer division and mod are slow on GPU. As an alternative you could map i,j to .z and .y, and calculate only k,l and more dimensions from .x in a similar way.

    0 讨论(0)
提交回复
热议问题