I want to port my c code to CUDA. The main computational part contains 3 for nested loops:
for (int i=0; i< Nx;i++){ for (int j=0;j
Many ways you can do it, One of them is:
for (int i=blockIdx.x; i< Nx; i += gridDim.x){ for (int j=threadIdx.y; j
The above you would call:
// nx,ny block dimensions kernel <<< dim3(nBlocks), dim3(nx, ny) >>> (...);