问题
I'm trying to get some benchmark timings in my CUDA program with nvprof but unfortunately it doesn't seem to be profiling any API calls or kernels. I looked for a simple beginners example to make sure I was doing it right and found one on the Nvidia dev blogs here:
https://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/
Code:
int main()
{
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
return 0;
}
Command line:
-bash-4.2$ nvcc profile.cu -o profile_test
-bash-4.2$ nvprof ./profile_test
So I replicated it word for word, line by line, and ran identical command line arguments. Unfortunately my result was the same:
-bash-4.2$ nvprof ./profile_test
==85454== NVPROF is profiling process 85454, command: ./profile_test
==85454== Profiling application: ./profile_test
==85454== Profiling result:
No kernels were profiled.
==85454== API calls:
No API activities were profiled.
I am running Nvidia toolkit 7.5
If anyone knows what what I'm doing wrong I'd be grateful to know the answer.
-----EDIT-----
So I modified the code to be
#include<cuda_profiler_api.h>
int main()
{
cudaProfilerStart();
const unsigned int N = 1048576;
const unsigned int bytes = N * sizeof(int);
int *h_a = (int*)malloc(bytes);
int *d_a;
cudaMalloc((int**)&d_a, bytes);
memset(h_a, 0, bytes);
cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy(h_a, d_a, bytes, cudaMemcpyDeviceToHost);
cudaProfilerStop();
return 0;
}
Unfortunately it did not change things.
回答1:
You need to call cudaProfilerStop()
(for Runtime API) before exiting from thread. This allows nvprof
to collect all necessary data.
According to CUDA doc:
To avoid losing profile information that has not yet been flushed, the application being profiled should make sure, before exiting, that all GPU work is done (using CUDA sychronization calls), and then call
cudaProfilerStop()
orcuProfilerStop()
. Doing so forces buffered profile information on corresponding context(s) to be flushed.
回答2:
It's a bug with unified memory profiling, the flag
--unified-memory-profiling off ./profile_test
resolves all problems for me.
来源:https://stackoverflow.com/questions/36970646/nvprof-not-picking-up-any-api-calls-or-kernels