Tools/Profiler

匿名 (未验证) 提交于 2019-12-03 00:18:01

NVProfiler

  • Visual Profiler
  • nvprof

1.1. Focused Profiling
不需要对程序做任何修改就可以进行profiling,说明依赖的是GPU上的硬件计数器等等,和程序无关。但是可以通过一些开始和结束标识来标记profiling开始和结束的位置,来达到更好的效果,几种典型的场景适合这种固定区域的profiling:

  • 代码分为初始化,拷贝数据,算法kernel运行,拷贝数据,数据校验和后处理,感兴趣的位置是kernel,此时可以采用;
  • 程序是分阶段的,每个阶段互相之间无依赖,每个阶段有不同的算法kernel,此时可以对每个阶段单独分析
  • 程序的迭代次数很多,每次迭代之前性能变化不明显,此时可以对一小部分迭代做分析
    API接口:
    cudaProfilerStart()/cudaProfilerStop() cuda_profiler_api.h
    cuProfilerStart()/cuProfilerStop() cudaProfiler.h


1.2 Marking Regions of CPU Activity
Visual Profiler可以看到所有cpu线程如何调用cuda kernel,为了看到CPU线程在执行GPU函数之外的执行轨迹,需要使用NVIDIA Tools Extension API (NVTX)来修改应用程序,nvprof同样支持。
1.3. Naming CPU and CUDA Resources
You can use the NVIDIA Tools Extension API to assign custom names for your CPU and GPU resources. Your custom names will then be displayed in the Timeline View.
1.4. Flush Profile Data
性能数据默认收集到缓存中,以低优先级落盘,为防止性能数据没及时下盘。可以在所有线程退出之前,调用cuProfilerStop() 强制刷盘。

https://docs.nvidia.com/cuda/profiler-users-guide/index.html#profiling-overview
https://docs.nvidia.com/cuda/cupti/r_main.html#r_main

图形界面,可以看到程序运行的性能测量结果。很强大,很多功能,需要具体下载下来使用一次才能体会。具体TODO

nvprof


有非常多的options,cuda/cpu/print/IO 等等options,还有一些执行模式和控制模式可以指定。具体TODO,需要每个指令尝试一下,或者才有需要的时候可以查询解决问题。

Remote Profiling

You can profile your remote application directly from nsight or the Visual Profiler.
Or you can use nvprof to collect the profile data on the remote system and then use nvvp on the host system to view and analyze the data.
TODO尝试运行一次

NVIDIA Tools Extension

提供API接口,完成两个功能

  • Tracing of CPU events and time ranges.
  • Naming of OS and CUDA resources.

TODO

MPI Profiling With nvprof

MPI程序也可以使用nvprof来进行性能分析。

TODO

MPS Profiling

You can collect profiling data for a CUDA application using Multi-Process Service(MPS) with nvprof and then view the timeline by importing the data in the Visual Profiler.
TODO

Dependency Analysis

没特别理解什么意思。大概是说程序中不同的片段彼此之间的依赖关系,可以通过这个工具进行分析。
TODO

Metrics Reference

根据硬件事件计数器计算得到的一些性能衡量指标。可根据实际情况进行查询。

Warp State

  • Instruction issued - An instruction or a pair of independent instructions was issued from a warp.
  • Stalled - Warp can be stalled for one of the following reasons.
    • Stalled for instruction fetch - The next instruction was not yet available.指令缓存导致stall
    • Stalled for execution dependency.依赖的寄存器还没准备好,前面的计算指令,FP64,barrier. try to increase instruction-level parallelism (ILP)
    • Stalled for memory dependency - The next instruction is waiting for a previous memory accesses to complete.依赖的寄存器还没准备好,前面的访存指令LD。
    • Stalled for memory throttle - A large number of outstanding memory requests prevents forward progress. 带宽限制,global和shared memory都有一定的带宽限制。
    • Stalled for texture
    • Stalled for sync - The warp is waiting for all threads to synchronize after a barrier instruction.
    • Stalled for constant memory dependency.常量内存的访存行为
    • Stalled for pipe busy - The warp is stalled because the functional unit required to execute the next instruction is busy.FP64导致busy
    • Stalled for not selected - Warp was ready but did not get a chance to issue as some other warp was selected for issue.充分优化的程序
    • Stalled for other - Warp is blocked for an uncommon reason like compiler or hardware reasons. barrier > 18,stall pipeline
文章来源: Tools/Profiler
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!