cuda: warp divergence overhead vs extra arithmetic
问题 Of course, warp divergence, via if and switch statements, is to be avoided at all costs on GPUs. But what is the overhead of warp divergence (scheduling only some of the threads to execute certain lines) vs. additional useless arithmetic? Consider the following dummy example: verison 1: __device__ int get_D (int A, int B, int C) { //The value A is potentially different for every thread. int D = 0; if (A < 10) D = A*6; else if (A < 17) D = A*6 + B*2; else if (A < 26) D = A*6 + B*2 + C; else D