nvprof not picking up any API calls or kernels

為{幸葍}努か 提交于 2019-12-11 12:16:05

问题


I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the Nvidia dev blogs here:

https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/

Code:

int main()
{
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    return 0;
}

Command line:

-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test

So I replicated it word for word, line by line, and ran identical command line arguments. Unfortunately my result was the same:

-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.

==85454== API calls:
No API activities were profiled. 

I am running Nvidia toolkit 7.5

If anyone knows what what I'm doing wrong I'd be grateful to know the answer.

-----EDIT-----

So I modified the code to be

#include<cuda_profiler_api.h>

int main()
{
    cudaProfilerStart();
    const unsigned int N = 1048576;
    const unsigned int bytes = N * sizeof(int);
    int *h_a = (int*)malloc(bytes);
    int *d_a;
    cudaMalloc((int**)&d_a, bytes);

    memset(h_a, 0, bytes);
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);

    cudaProfilerStop();
    return 0;
}

Unfortunately it did not change things.


回答1:


You need to call cudaProfilerStop() (for Runtime API) before exiting from thread. This allows nvprof to collect all necessary data.

According to CUDA doc:

To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call cudaProfilerStop() or cuProfilerStop(). Doing so forces buffered profile information on corresponding context(s) to be flushed.




回答2:


It's a bug with unified memory profiling, the flag

--unified-memory-profiling off  ./profile_test

resolves all problems for me.



来源:https://stackoverflow.com/questions/36970646/nvprof-not-picking-up-any-api-calls-or-kernels

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!