I have a kernel that calls a device function inside an if statement. The code is as follows:
__device__ void SetValues(int *ptr,int id)
{
if
CUDA actually inlines all functions by default (although Fermi and newer architectures do also support a proper ABI with function pointers and real function calls). So your example code gets compiled to something like this
__global__ void Kernel(int *ptr)
{
if(threadIdx.x<2)
if(ptr[threadIdx.x]==threadIdx.x)
ptr[threadIdx.x]++;
}
Execution happens in parallel, just like normal code. If you engineer a memory race into a function, there is no serialization mechanism that can save you.