发表新帖

发表新帖

How to calculate the achieved bandwidth of a CUDA kernel

后端未结

关注

 1  1079

I want a measure of how much of the peak memory bandwidth my kernel archives.

Say I have a NVIDIA Tesla C1060, which has a max Bandwidth of 102.4 GB/s. In my kernel

相关标签:

1条回答

长发绾君心

2020-12-15 15:34
- You do not really have 1.000.000 of threads running at once. You do ~32GB of global memory accesses where the bandwidth will be given by the current threads running (reading) in the SMs and the size of the data read.
- All accesses in global memory are cached in L1 and L2 unless you specify un-cached data to the compiler.
- I think so. Achieved bandwidth is related to global memory.
- I will recommend use the visual profiler to see the read/write/global memory bandwidth. Would be interesting if you post your result :).
Default counters in Visual Profiler gives you enough information to get an idea about your kernel (memory bandwidth, shared memory bank conflicts, instructions executed...).

Regarding to your question, to calculate the achieved global memory throughput:

Compute Visual Profiler. DU-05162-001_v02 | October 2010. User Guide. Page 56, Table 7. Supported Derived Statistics.

Global memory read throughput in giga-bytes per second. For compute capability < 2.0 this is calculated as (((gld_32*32) + (gld_64*64) + (gld_128*128)) * TPC) / gputime For compute capability >= 2.0 this is calculated as ((DRAM reads) * 32) / gputime

Hope this help.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题