I have a CUDA kernel that calls out to a series of device functions.
What is the best way to get the execution time for each of the device functions?
What is
In my own code, I use the clock()
function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers
in the host code.
A few notes:
#ifdef USETIMERS
so you can switch them off easily.clock()
returns integer values of type clock_t
, I store the accumulated values as float
, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )
is necessary in case the clock counter wraps around.P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.