Invalid Argument error when copying data from device to host

后端未结

关注

 4  1311

I am having problems copying data from my device back to the host. My data are arranged in a struct:

typedef struct Array2D {
    double* arr;        
    in


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2021-01-14 17:40
              
            
            
                                                                       
This (copying device-allocated memory using cudaMemcpy) is a known limitation in CUDA 4.1.  A fix is in the works and will be released in a future version of the CUDA runtime.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  生来不讨喜        
                
              
                            
                2021-01-14 17:49
              
            
            
                                                                       
The error you are seeing is almost certainly caused by h_output->arr not being a valid device pointer, or by h_output->rows or h_output->cols having incorrect values somehow. You have chosen not to show any code explaining how the contents of the source memory d_output have been set, so it is not possible to say for sure what is the root cause of your problem. 

To illustrate the point, here is a complete, runnable demo showing the posted code in action:

#include <cstdlib>
#include <cstdio>

inline void GPUassert(cudaError_t code, char * file, int line, bool Abort=true)
{
    if (code != 0) {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code),file,line);
        if (Abort) exit(code);
    }       
}

#define GPUerrchk(ans) { GPUassert((ans), __FILE__, __LINE__); }

typedef float Real;

typedef struct Array2D {
    Real* arr;        
    int rows;       
    int cols;       
} Array2D;

__global__ void kernel(const int m, const int n, Real *lval, Array2D *output)
{
    lval[threadIdx.x] = 1.0f + threadIdx.x;
    if (threadIdx.x == 0) {
        output->arr = lval;
        output->rows = m;
        output->cols = n;
    }
}

int main(void)
{
    const int m=8, n=8, mn=m*n;

    Array2D *d_output;
    Real *d_arr;
    GPUerrchk( cudaMalloc((void **)&d_arr,sizeof(Real)*size_t(mn)) ); 

    GPUerrchk( cudaMalloc((void **)&d_output, sizeof(Array2D)) );
    kernel<<<1,mn>>>(m,n,d_arr,d_output);
    GPUerrchk( cudaPeekAtLastError() );

    // This section of code is the same as the original question
    Array2D *h_output = (Array2D*)malloc(sizeof(Array2D));
    GPUerrchk( cudaMemcpy(h_output, d_output, sizeof(Array2D), cudaMemcpyDeviceToHost) );
    size_t sz = size_t(h_output->rows*h_output->cols)*sizeof(Real);
    Real *h_arr = (Real*)malloc(sz);
    GPUerrchk( cudaMemcpy(h_arr, h_output->arr, sz, cudaMemcpyDeviceToHost) );

    for(int i=0; i<h_output->rows; i++)
        for(int j=0; j<h_output->cols; j++)
            fprintf(stdout,"(%d %d) %f\n", i, j, h_arr[j + i*h_output->rows]);

    return 0;
}


I have had to take a few liberties here, because I only have a compute capability 1.2 device at my disposal, so no device side malloc and no double precision. But the host side API calls which retrieve a valid Array2D structure from device memory and use its contents are effectively the same. Running the program works as expected:

$ nvcc -Xptxas="-v" -arch=sm_12 Array2D.cu 
ptxas info    : Compiling entry function '_Z6kerneliiPfP7Array2D' for 'sm_12'
ptxas info    : Used 2 registers, 16+16 bytes smem

$ cuda-memcheck ./a.out 
========= CUDA-MEMCHECK
(0 0) 1.000000
(0 1) 2.000000
(0 2) 3.000000
(0 3) 4.000000
(0 4) 5.000000
(0 5) 6.000000
(0 6) 7.000000
(0 7) 8.000000
(1 0) 9.000000
(1 1) 10.000000
(1 2) 11.000000
(1 3) 12.000000
(1 4) 13.000000
(1 5) 14.000000
(1 6) 15.000000
(1 7) 16.000000
(2 0) 17.000000
(2 1) 18.000000
(2 2) 19.000000
(2 3) 20.000000
(2 4) 21.000000
(2 5) 22.000000
(2 6) 23.000000
(2 7) 24.000000
(3 0) 25.000000
(3 1) 26.000000
(3 2) 27.000000
(3 3) 28.000000
(3 4) 29.000000
(3 5) 30.000000
(3 6) 31.000000
(3 7) 32.000000
(4 0) 33.000000
(4 1) 34.000000
(4 2) 35.000000
(4 3) 36.000000
(4 4) 37.000000
(4 5) 38.000000
(4 6) 39.000000
(4 7) 40.000000
(5 0) 41.000000
(5 1) 42.000000
(5 2) 43.000000
(5 3) 44.000000
(5 4) 45.000000
(5 5) 46.000000
(5 6) 47.000000
(5 7) 48.000000
(6 0) 49.000000
(6 1) 50.000000
(6 2) 51.000000
(6 3) 52.000000
(6 4) 53.000000
(6 5) 54.000000
(6 6) 55.000000
(6 7) 56.000000
(7 0) 57.000000
(7 1) 58.000000
(7 2) 59.000000
(7 3) 60.000000
(7 4) 61.000000
(7 5) 62.000000
(7 6) 63.000000
(7 7) 64.000000
========= ERROR SUMMARY: 0 errors

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2021-01-14 17:49
              
            
            
                                                                       
It looks like h_output is allocated with a call to malloc().  In the first call to cudaMemcpy() (line 2), h_output is being used as as a host pointer (which seems right).  In the second call to cudaMemcpy() (line 4), h_output->arr is being used as a device pointer (which does not seem right).  In that 4th line, it looks like you are copying from host memory to host memory.  So, you will probably want to use just a straight memcpy() instead of cudaMemcpy().

At least that is what it looks like from the code you have provided.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2021-01-14 17:58
              
            
            
                                                                       
I tried allocating the pointer Array2D->arr on the host using cudaMalloc instead of allocating it on the device using malloc. After that, the code works as intended. 

It looks very much like the problem described in the thread (http://forums.nvidia.com/index.php?showtopic=222659) on nVidia's forum that Pavan referred to in the comments to the question. 

I think that probably closes the question for now, as the code works fine. However, if anyone has a proposal for a solution which utilizes malloc on the device, feel free to post.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复