I have a CUDA program containing a host function and a device function Execute(). In the host function, I allocate a global memory outp
There are many things wrong with this code. For example, you are only allocating 1 byte in both of these two lines, not enough to hold a single instance of Structure_A
.
output_cpu= (Structure_A*)malloc(1);
p_cpu=(int *)malloc(1);
But the immediate cause of your error is that you are doing a memcpy from a device runtime heap allocated pointer (i.e allocated with malloc
or new
inside your device code) to a host pointer.
err=cudaMemcpy(p_cpu,output_cpu[0].p,sizeof(int),cudaMemcpyDeviceToHost);
Unfortunately the host runtime API for cudaMalloc, cudaFree, and cudaMemcpy is not currently compatible with memory allocated on the device runtime heap.