*Modified* Nvidia Maxwell, increased global memory instruction count

二次信任 提交于 2019-12-19 04:56:45

问题


I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.

spec. for both graphic cards

GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s

GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s

Ubuntu 14.04

CUDA driver 340.29

toolkit 6.5

I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.

I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)

Anyone know why the number of global instruction counts are increased on Maxwell architecture?

Thank you.


回答1:


The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.

On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.

On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.

If we look at 3 different cases:

CASE 1: Each thread in a warp accesses the same 32-bit offset.

CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.

CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.

CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.

gld_transcations for each list case by architecture

            Kepler      Maxwell
Case 1      1           4
Case 2      32          32
Case 3      1           8
Case 4      1           4-16

My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.

I would recommend looking at l2_{read, write}_{transactions, throughput}.



来源:https://stackoverflow.com/questions/29117708/modified-nvidia-maxwell-increased-global-memory-instruction-count

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!