Timing different sections in CUDA kernel

前端 未结 1 1621
时光说笑
时光说笑 2020-12-31 12:32

I have a CUDA kernel that calls out to a series of device functions.

What is the best way to get the execution time for each of the device functions?

What is

相关标签:
1条回答
  • 2020-12-31 12:56

    In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

    enum {
        tid_this = 0,
        tid_that,
        tid_count
        };
    __device__ float cuda_timers[ tid_count ];
    #ifdef USETIMERS
     #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
     #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
    #else
     #define TIMER_TIC
     #define TIMER_TOC(tid)
    #endif
    

    These can then be used to instrument the device code as follows:

    __global__ mykernel ( ... ) {
    
        /* Start the timer. */
        TIMER_TIC
    
        /* Do stuff. */
        ...
    
        /* Stop the timer and store the results to the "timer_this" counter. */
        TIMER_TOC( tid_this );
    
        }
    

    You can then read the cuda_timers in the host code.

    A few notes:

    • The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
    • Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
    • The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
    • The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
    • Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
    • The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

    P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

    0 讨论(0)
提交回复
热议问题