Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?
Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:
The latency to
The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.
The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.
The profiler's exposes the following counters:
In the zero copy case all of these metrics should be much lower.
The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).