Interpreting the verbose output of ptxas, part I

后端 未结 1 488
别跟我提以往
别跟我提以往 2020-12-14 18:50

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

I compiled my kernel.cu file to a kernel.o

相关标签:
1条回答
  • 2020-12-14 19:25
    • Each CUDA thread is using 46 registers? Yes, correct
    • There is no register spilling to local memory? Yes, correct
    • Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
    • What is the difference between 0 byte spill stores and 0 bytes spill loads?
      • Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
    • Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
      • Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.
    0 讨论(0)
提交回复
热议问题