问题
I had an experiment on both GTX760(Kepler) and GTX750Ti(Maxwell) using benchmarks(Parboil, Rodinia). Then I analyzed results using Nvidia visual profiler. In most of the applications, the number of global instructions are enormously increased up to 7-10 times on Maxwell architecture.
spec. for both graphic cards
GTX760 6.0Gbps 2048MB 256bit 192.2 GB/s
GTX750Ti 5.4Gbps 2048MB 128bit 86.4Gb/s
Ubuntu 14.04
CUDA driver 340.29
toolkit 6.5
I compiled the benchmark application(No modification) then I collected the results from NVVP(6.5). Analyze all > Kernel Memory > From L1/Shared Memory section, I collected global load transaction counts.
I attached screenshots of our simulation result of histo ran on kepler(link) and maxwell(link)
Anyone know why the number of global instruction counts are increased on Maxwell architecture?
Thank you.
回答1:
The counter gld_transactions is not comparable between Kepler and Maxwell architecture. Furthermore, this is not equivalent to the count of global instructions executed.
On Fermi/Kepler this counts the number of SM to L1 128 byte requests. This can increment from 0-32 per global/generic instruction executed.
On Maxwell global operations all go through the TEX (unified cache). The TEX cache is completely different from the Fermi/Kepler L1 cache. Global transactions measure the number of 32B sectors accessed in the cache. This can increment from 0-32 per global/generic instruction executed.
If we look at 3 different cases:
CASE 1: Each thread in a warp accesses the same 32-bit offset.
CASE 2: Each thread in a warp accesses a 32-bit offset with a 128 byte stride.
CASE 3: Each thread in a warp accesses a unique 32-bit offset based upon its lane index.
CASE 4: Each thread in a warp accesses a unique 32-bit offset in a 128 byte memory range that is 128-byte aligned.
gld_transcations for each list case by architecture
Kepler Maxwell
Case 1 1 4
Case 2 32 32
Case 3 1 8
Case 4 1 4-16
My recommendation is to avoid looking at gld_transactions. A future version of the CUDA profilers should use different metrics that are more actionable and comparable to past architectures.
I would recommend looking at l2_{read, write}_{transactions, throughput}.
来源:https://stackoverflow.com/questions/29117708/modified-nvidia-maxwell-increased-global-memory-instruction-count