I have some problems running my code on a GTX 480 with Compute Capability 2.0
I always get following error if I launch the kernel with 1024 threads per Block:
<
I had the same error.
Thanks to http://cuda-programming.blogspot.fr/2013/01/handling-cuda-error-messages.html, I understood the error. They say :
"Too Many Resources Requested for Launch - This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem."
Basically I used to be able to have a given number of threads per block, (8x8x16=1024 for a 3D Kernel). But if you nest your kernel calls, you further reduce the number of available registers.