The issue is that you defined a __device__
function in separate compilation unit from __global__
that calls it. You need to either explicitely enable relocatable device code mode by adding -dc
flag or move your definition to the same unit.
From nvcc documentation:
--device-c|-dc
Compile each .c/.cc/.cpp/.cxx/.cu input file into an object file that contains relocatable device code. It is equivalent to
--relocatable-device-code
=true --compile
.
See Separate Compilation and Linking of CUDA C++ Device Code for more information.