here is simple cuda code.
I am testing the time of accessing global memory. read and right.
below is kernel function(test1()).
enter code here
__glo
When you delete the code line:
direct_map[index] = -1;
your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.
You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable
Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.