Suppose I have
template void foo(float* data, size_t length);
and I want to compile one instantiation with
As of CUDA 7.5 (the latest version I am familiar with, although CUDA 8.0 is currently shipping), nvcc
does not support function attributes that allow programmers to apply specific compiler optimizations on a per-function basis.
Since optimization configurations set via command line switches apply to the entire compilation unit, one possible approach is to use as many different compilation units as there are different optimization configurations, as already noted in the question; source code may be shared and #include
-ed from a common file.
With nvcc
, the command line switch --use_fast_math
basically controls three areas of functionality:
You can apply some of these changes with per-operation granularity by using appropriate intrinsics, others by using PTX inline assembly.