allocate unified memory in my program. aftering running, it throws CUDA Error:out of memory,but still has free memory

前端未结

关注

 1  485

Before asking this, I have read this question , which is similar to mine.

Here I will provide my program in detail.

#define N 70000
#define M 1000

c


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2021-01-14 07:03
              
            
            
                                                                       
If I modify your code with some instrumentation, like this:

#include <cstdio>
#include <iostream>

#define N 70000
#define M 1000

class ObjBox
{
    public:

        int oid; 
        float x; 
        float y; 
        float ts;
};

class Bucket
{
    public:

        int bid; 
        int nxt; 
        ObjBox *arr_obj; 
        int nO;
};

int main()
{

    Bucket *arr_bkt;
    cudaMallocManaged(&arr_bkt, N * sizeof(Bucket));

    for (int i = 0; i < N; i++) {
        arr_bkt[i].bid = i; 
        arr_bkt[i].nxt = -1;
        arr_bkt[i].nO = 0;

        size_t allocsz = size_t(M) * sizeof(ObjBox);
        cudaError_t r = cudaMallocManaged(&(arr_bkt[i].arr_obj), allocsz);
        if (r != cudaSuccess) {
            printf("CUDA Error on %s\n", cudaGetErrorString(r));
            exit(0);
        } else {
            size_t total_mem, free_mem;
            cudaMemGetInfo(&free_mem, &total_mem);
            std::cout << i << ":Allocated " << allocsz;
            std::cout << " Currently " << free_mem << " bytes free" << std::endl;
        } 

        for (int j = 0; j < M; j++) {
            arr_bkt[i].arr_obj[j].oid = -1;
            arr_bkt[i].arr_obj[j].x = -1;
            arr_bkt[i].arr_obj[j].y = -1;
            arr_bkt[i].arr_obj[j].ts = -1;
        }
    }

    std::cout << "Bucket Array Initial Completed..." << std::endl;
    cudaFree(arr_bkt);

    return 0;
}


and compile and run it on a unified memory system with 16Gb physical host memory and 2Gb physical device memory with the Linux 352.39 driver, I get this:

0:Allocated 16000 Currently 2099871744 bytes free
1:Allocated 16000 Currently 2099871744 bytes free
2:Allocated 16000 Currently 2099871744 bytes free
3:Allocated 16000 Currently 2099871744 bytes free
4:Allocated 16000 Currently 2099871744 bytes free
5:Allocated 16000 Currently 2099871744 bytes free
6:Allocated 16000 Currently 2099871744 bytes free
7:Allocated 16000 Currently 2099871744 bytes free
8:Allocated 16000 Currently 2099871744 bytes free
9:Allocated 16000 Currently 2099871744 bytes free
....
....
....
65445:Allocated 16000 Currently 1028161536 bytes free
65446:Allocated 16000 Currently 1028161536 bytes free
65447:Allocated 16000 Currently 1028161536 bytes free
65448:Allocated 16000 Currently 1028161536 bytes free
65449:Allocated 16000 Currently 1028161536 bytes free
65450:Allocated 16000 Currently 1028161536 bytes free
65451:Allocated 16000 Currently 1028161536 bytes free
CUDA Error on out of memory    


i.e. it reports out of memory with plenty of free memory remaining on the device. 

I think the key to understanding this is the number of allocations, at the failure point, rather than their size. 65451 is suspiciously close to 65535 (i.e. 2^16). Allowing for the internal memory allocations that the runtime makes, I am going to guess that there is some sort of accidental or deliberate limit on the total number of memory managed memory allocations to 65535.

I would be very interested to see whether you can reproduce this. If you can, I would be contemplating filing a bug report with NVIDIA.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复