Timing different sections in CUDA kernel

前端未结

关注

 1  1621

时光说笑

I have a CUDA kernel that calls out to a series of device functions.

What is the best way to get the execution time for each of the device functions?

What is

相关标签:

1条回答

温柔的废话

2020-12-31 12:56
In my own code, I use the clock() function to get precise timings. For convenience, I have the macros
```
enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif
```
These can then be used to instrument the device code as follows:
```
__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }
```
You can then read the cuda_timers in the host code.

A few notes:
- The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
- Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
- The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
- The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
- Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
- The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.
P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.
0 讨论(0)
发布评论:

提交评论
- 加载中...