I\'m running windows 7 64 bits, cuda 4.2, visual studio 2010.
First, I run some code on cuda, then download the data back to host. Then do some processing and move b
I suggest you to use cudpp, in my opinion is faster than thrust (I'm writing master thesis about optimization and I tried both libraries). If copy is very slow, you can try to write your own kernel to copy data.
The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy
but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:
cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy
slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch()
calls which serialise kernel memory transactions".