CUDA kernel - nested for loop

后端未结

关注

 3  1255

Hello I\'m trying to write a CUDA kernel to perform the following piece of code.

for (n = 0; n < (total-1); n++)
{
  a = values[n];

  for ( i = n+1; i &


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  -上瘾入骨i        
                
              
                            
                2020-12-08 09:41
              
            
            
                                                                       
I'll probably be way wrong but the n < (total-1) check in

for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)


seems different than the original version.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2020-12-08 09:47
              
            
            
                                                                       
Realize this problem in 2D and launch your kernel with 2D thread blocks. The total number of threads in x and y dimension will be equal to total . The kernel code should look like this:

__global__ void calc(float *values, float *newvalues, int total){


float a,b,c;

int n= blockIdy.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;

  if (n>=total || i>=total)
        return;

a = values[n];
b = values[i] - a;
c = b*b;
 if( c < 10)
        newvalues[i] = c;  

// I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c;  


}


Update:

This is how you should call the kernel

dim3 block(16,16);
dim3 grid (  (total+15)/16,  (total+15)/16  );
calc<<<grid,block>>>(float *val, float *newval, int T);


Also make sure you add this line in kernel (see updated kernel)

if (n>=total || i>=total)
return;

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-08 09:51
              
            
            
                                                                       
Why don't you just remove the outter loop and start the kernel with as many threads as you need for this loop? It's a bit weird to have a loop that depends on your blockId. Normally you try to avoid these loops.
Secondly it seems to me that newvalues[i] can be overriden by different threads.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复