I have a Tesla C2070 that is supposed to have 5636554752 bytes of memory.
However, this gives me an error:
int *buf_d = NULL;
err = cudaMalloc((void
The basic problem is in your question title - you don't actually know that you have sufficient memory, you are assuming you do. The runtime API includes the cudaMemGetInfo
function which will return how much free memory there is on the device. When a context is established on a device, the driver must reserved space for device code, local memory for each thread, fifo buffers for printf
support, stack for each thread, and heap for in-kernel malloc
/new
calls (see this answer for further details). All of this can consume rather a lot of memory, leaving you with much less than the maximum avialable memory after ECC reservations you are assuming to be available to your code. The API also includes cudaDeviceGetLimit
which you can use to query the amounts of memory that on device runtime support is consuming. There is also a companion call cudaDeviceSetLimit
which can allow you to change the amount of memory each component of runtime support will reserve.
Even after you tuned the runtime memory footprint to your tastes and have the actual free memory value from the driver, there is still page size granularity and fragmentation considerations to contend with. Rarely is it possible to allocate every byte of what the API will report as free. Usually, I would do something like this when the objective is to try and allocate every available byte on the card:
const size_t Mb = 1<<20; // Assuming a 1Mb page size here
size_t available, total;
cudaMemGetInfo(&available, &total);
int *buf_d = 0;
size_t nwords = total / sizeof(int);
size_t words_per_Mb = Mb / sizeof(int);
while(cudaMalloc((void**)&buf_d, nwords * sizeof(int)) == cudaErrorMemoryAllocation)
{
nwords -= words_per_Mb;
if( nwords < words_per_Mb)
{
// signal no free memory
break;
}
}
// leaves int buf_d[nwords] on the device or signals no free memory
(note never been near a compiler, only safe on CUDA 3 or later). It is implicitly assumed that none of the obvious sources of problems with big allocations apply here (32 bit host operating system, WDDM windows platform without TCC mode enabled, older known driver issues).