I have something very similar to the code:
int k, no_streams = 4;
cudaStream_t stream[no_streams];
for(k = 0; k < no_streams; k++) cudaStreamCreate(&strea
According to this post on the NVIDIA forums, the profiler will serialize streaming to get accurate timing data. If you think your timings are off, make sure you're using CUDA events...
I've been experimenting with streaming lately, and I found the "simpleMultiCopy" example from the SDK to be really helpful, particularly with the appropriate logic and synchronizations.