Determining the optimal value for #pragma unroll N in CUDA
问题 I understand how #pragma unroll works, but if I have the following example: __global__ void test_kernel( const float* B, const float* C, float* A_out) { int j = threadIdx.x + blockIdx.x * blockDim.x; if (j < array_size) { #pragma unroll for (int i = 0; i < LIMIT; i++) { A_out[i] = B[i] + C[i]; } } } I want to determine the optimal value for LIMIT in the kernel above which will be launched with x number of threads and y number of blocks. The LIMIT can be anywhere from 2 to 1<<20 . Since 1