Why integer division and modulo isn't optimized out in NVRTC

问题

I compiled a kernel in NVRTC:

__global__ void kernel_A(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx / 32;
    unsigned char lane_id = idx % 32;
    /* ... */
}

I know integer division and modulo are very costly on CUDA GPUs. However I thought this kind of division-by-power-of-2 should be optimized into bit operations, until I found it isn't:

__global__ void kernel_B(/* args */) {
    unsigned short idx = threadIdx.x;
    unsigned char warp_id = idx >> 5;
    unsigned char lane_id = idx & 31;
    /* ... */
}

it seems kernel_B just runs faster. When omitting all other codes in kernel, launching with 1024 blocks of size 1024, nvprof shows kernel_A runs for 15.2us in average, while kernel_B runs 7.4us in average. I speculate NVRTC did not optimize out the integer division and modulo.

The result is obtained on a GeForce 750 Ti, CUDA 8.0, averaged from 100 calls. The compiler options given to nvrtcCompileProgram() is -arch compute_50.

Is this expected?

回答1:

Did a thorough bugsweep in the codebase. Turns out my app was built in DEBUG mode. This causes additional flags -G and -lineinfo passed to nvrtcCompileProgram()

From nvcc man page:

--device-debug (-G)

Generate debug information for device code. Turns off all optimizations. Don't use for profiling; use -lineinfo instead.

来源：https://stackoverflow.com/questions/44300294/why-integer-division-and-modulo-isnt-optimized-out-in-nvrtc

标签

optimization

cuda

nvrtc

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!