发表新帖

发表新帖

Interpreting the verbose output of ptxas, part I

后端未结

关注

 1  488

别跟我提以往

I am trying to understand resource usage for each of my CUDA threads for a hand-written kernel.

I compiled my kernel.cu file to a kernel.o

相关标签:

1条回答

抹茶落季

2020-12-14 19:25
- Each CUDA thread is using 46 registers? Yes, correct
- There is no register spilling to local memory? Yes, correct
- Is 72 bytes the sum-total of the memory for the stack frames of the __global__ and __device__ functions? Yes, correct
- What is the difference between 0 byte spill stores and 0 bytes spill loads?
  - Fair question, the loads could be greater than the stores since you could spill a computed value, load it once, discard it (i.e. store something else into that register) then load it again (i.e. reuse it). Update: note also that the spill load/store count is based on static analysis as described by @njuffa in the comments below
- Why is the information for cmem (which I am assuming is constant memory) repeated twice with different figures? Within the kernel I am not using any constant memory. Does that mean the compiler is, under the hood, going to tell the GPU to use some constant memory?
  - Constant memory is used for a few purposes including __constant__ variables and kernel arguments, different "banks" are used, that starts to get a bit detailed but as long as you use less than 64KB for your __constant__ variables and less than 4KB for kernel arguments you will be ok.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题