I am using the following two functions to time different parts (cudaMemcpyHtoD, kernel execution, cudaMemcpyDtoH) of my code (which includes multi-gpus, concurrent kernels on sa
First, if this is for production code, you may want to be able to do something between the second cudaEventRecord and cudaEventSynchronize(). Otherwise, this could reduce the ability of your app to overlap GPU and CPU work.
Next, I would separate event creation and destruction from event recording. I'm not sure of the cost, but in general you might not want to call cudaEventCreate and cudaEventDestroy often.
What I would do is create a class like this
class EventTimer {
public:
EventTimer() : mStarted(false), mStopped(false) {
cudaEventCreate(&mStart);
cudaEventCreate(&mStop);
}
~EventTimer() {
cudaEventDestroy(mStart);
cudaEventDestroy(mStop);
}
void start(cudaStream_t s = 0) { cudaEventRecord(mStart, s);
mStarted = true; mStopped = false; }
void stop(cudaStream_t s = 0) { assert(mStarted);
cudaEventRecord(mStop, s);
mStarted = false; mStopped = true; }
float elapsed() {
assert(mStopped);
if (!mStopped) return 0;
cudaEventSynchronize(mStop);
float elapsed = 0;
cudaEventElapsedTime(&elapsed, mStart, mStop);
return elapsed;
}
private:
bool mStarted, mStopped;
cudaEvent_t mStart, mStop;
};
Note I didn't include cudaSetDevice() -- seems to me that should be left to the code that uses this class, to make it more flexible. The user would have to ensure the same device is active when start and stop are called.
PS: It is not NVIDIA's intent for CUTIL to be relied upon for production code -- it is used simply for convenience in our examples and is not as rigorously tested or optimized as the CUDA libraries and compilers themselves. I recommend you extract things like cutilSafeCall() into your own libraries and headers.