Cuda zero-copy performance

前端 未结 1 677
轻奢々
轻奢々 2021-01-23 12:44

Does anyone have experience with analyzing the performance of CUDA applications utilizing the zero-copy (reference here: Default Pinned Memory Vs Zero-Copy Memory) memory model?

相关标签:
1条回答
  • 2021-01-23 13:07

    Fermi and Kepler GPUs need to replay memory instructions for multiple reasons:

    1. The memory operation was for a size specifier (vector type) that requires multiple transactions in order to perform the address divergence calculation and communicate data to/from the L1 cache.
    2. The memory operation had thread address divergence requiring access to multiple cache lines.
    3. The memory transaction missed the L1 cache. When the miss value is returned to L1 the L1 notifies the warp scheduler to replay the instruction.
    4. The LSU unit resources are full and the instruction needs to be replayed when the resource are available.

    The latency to

    • L2 is 200-400 cycles
    • device memory (dram) is 400-800 cycles
    • zero copy memory over PCIe is 1000s of cycles

    The replay overhead is increasing due to the increase in misses and contention for LSU resources due to increased latency.

    The global load efficiency is not increasing as it is the ratio of the ideal amount of data that would need to be transferred for the memory instructions that were executed to the actual amount of data transferred. Ideal means that the executed threads accessed sequential elements in memory starting at a cache line boundary (32-bit operation is 1 cache line, 64-bit operation is 2 cache lines, 128-bit operation is 4 cache lines). Accessing zero copy is slower and less efficient but it does not increase or change the amount of data transferred.

    The profiler's exposes the following counters:

    • gld_throughput
    • l1_cache_global_hit_rate
    • dram_{read, write}_throughput
    • l2_l1_read_hit_rate

    In the zero copy case all of these metrics should be much lower.

    The Nsight VSE CUDA Profiler memory experiments will show the amount of data accessed over PCIe (zero copy memory).

    0 讨论(0)
提交回复
热议问题