What does #pragma unroll do exactly? Does it affect the number of threads?

后端未结

关注

 1  639

故里飘歌 2021-01-30 17:12

I\'m new to CUDA, and I can\'t understand loop unrolling. I\'ve written a piece of code to understand the technique

__global__ void kernel(float *b, int size)
{


      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   死守一世寂寞
                                             
                
                
                (楼主)
            
              
              
                2021-01-30 18:08
              

            
            
                        
No. It means you have called a CUDA kernel with one block and that one block has 100 active threads. You're passing size as the second function parameter to your kernel. In your kernel each of those 100 threads executes the for loop 100 times.

#pragma unroll is a compiler optimization that can, for example, replace a piece of code like 

for ( int i = 0; i < 5; i++ )
    b[i] = i;


with

b[0] = 0;
b[1] = 1;
b[2] = 2;
b[3] = 3;
b[4] = 4;


by putting #pragma unroll directive right before the loop. The good thing about the unrolled version is that it involves less processing load for the processor. In case of for loop version, the processing, in addition to assigning each i to b[i], involves i initialization, evaluating i<5 for 6 times, and incrementing i for 5 times. While in the second case, it only involves filing up b array content (perhaps plus int i=5; if i is used later). Another benefit of loop unrolling is the enhancement of Instruction-Level Parallelism (ILP). In the unrolled version, there would possibly be more operations for the processor to push into processing pipeline without being worried about the for loop condition in every iteration.

Posts like this explain that runtime loop unrolling cannot happen for CUDA. In your case CUDA compiler doesn't have any clues that size is going to be 100 so compile-time loop unrolling will not occur, and so if you force unrolling, you may end up hurting the performance.

If you are sure that the size is 100 for all executions, you can unroll your loop like below:

#pragma unroll
for(int i=0;i


in which SIZE is known in compile time with #define SIZE 100.

I also suggest you to have proper CUDA error checking in your code (explained here).
    
             
                                                        
            

            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          

                              			
        

        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复