I\'ve a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:
__global__ void warmStart(
Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.