I have some problems running my code on a GTX 480 with Compute Capability 2.0
I always get following error if I launch the kernel with 1024 threads per Block:
<I had the same error.
Thanks to http://cuda-programming.blogspot.fr/2013/01/handling-cuda-error-messages.html, I understood the error. They say :
"Too Many Resources Requested for Launch - This error means that the number of registers available on the multiprocessor is being exceeded. Reduce the number of threads per block to solve the problem."
Basically I used to be able to have a given number of threads per block, (8x8x16=1024 for a 3D Kernel). But if you nest your kernel calls, you further reduce the number of available registers.
Have you tried upgrading the driver of the GPU? For me the program just ran until I got unlucky, with the exact same problem. No warnings about minimal driver versions whatsoever.