copy from GPU to CPU is slower than copying CPU to GPU

前端未结

关注

 2  1208

I have started learning cuda for a while and I have the following problem

See how I am doing below:

Copy GPU

int* B;
// ...


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2021-01-18 16:45
              
            
            
                                                                       
Instead of using clock() to measure time, you should use events:

With events you would have something like this: 

  cudaEvent_t start, stop;   // variables that holds 2 events 
  float time;                // Variable that will hold the time
  cudaEventCreate(&start);   // creating the event 1
  cudaEventCreate(&stop);    // creating the event 2
  cudaEventRecord(start, 0); // start measuring  the time

  // What you want to measure
  cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
  cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);

  cudaEventRecord(stop, 0);                  // Stop time measuring
  cudaEventSynchronize(stop);               // Wait until the completion of all device 
                                            // work preceding the most recent call to cudaEventRecord()

  cudaEventElapsedTime(&time, start, stop); // Saving the time measured


EDIT : Additional information :

"The kernel launch returns control to the CPU thread before it is finished. Therefore your timing construct is measuring both the kernel execution time as well as the 2nd memcpy. When timing the copy after the kernel, your timer code is being executed immediately, but the cudaMemcpy is waiting for the kernel to complete before it starts. This also explains why your timing measurement for the data return seems to vary based on kernel loop iterations. It also explains why the time spent on your kernel function is "negligible"". credits to Robert Crovella
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-18 16:47
              
            
            
                                                                       
As for your second question

 B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach


When performing calculation on the GPU, due to sync reasons, every thread which has finished his job does not perform any calculations until all the thread in the same workgroup have finished. 

In other words, the time you need to perform this calculation will be that of the worst case, it doesn't matter if most of the threads don't go all the way down.

I am not sure about your first question, how do you measure the time? I am not too familiar with cuda, but I think that when copying from CPU to GPU the implementation bufferize your data, hiding the effective time involved.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复