I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its executi
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1". Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).