What does #pragma unroll do exactly? Does it affect the number of threads?

后端 未结 1 640
故里飘歌
故里飘歌 2021-01-30 17:12

I\'m new to CUDA, and I can\'t understand loop unrolling. I\'ve written a piece of code to understand the technique

__global__ void kernel(float *b, int size)
{
         


        
相关标签:
1条回答
  • 2021-01-30 18:08

    No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.

    #pragma unroll is a compiler optimization that can, for example, replace a piece of code like

    for ( int i = 0; i < 5; i++ )
        b[i] = i;
    

    with

    b[0] = 0;
    b[1] = 1;
    b[2] = 2;
    b[3] = 3;
    b[4] = 4;
    

    by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.

    Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.

    If you are sure that the size is 100 for all executions, you can unroll your loop like below:

    #pragma unroll
    for(int i=0;i<SIZE;i++)  //or simply for(int i=0;i<100;i++)
        b[i]=i;
    

    in which SIZE is known in compile time with #define SIZE 100.

    I also suggest you to have proper CUDA error checking in your code (explained here).

    0 讨论(0)
提交回复
热议问题