I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.
I compiled my kernel.cu
file to a kernel.o
__global__
and __device__
functions? Yes, correct__constant__
variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__
variables and less than 4KB for kernel arguments you will be ok.