问题
I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__
function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__
function (not a __global__
), in order to "override" an existing function?
回答1:
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
回答2:
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
- Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
- Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
- You can generate the appropriate "myFunc" kernel based on equations at run time.
- Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.
来源:https://stackoverflow.com/questions/38856078/nvrtc-and-device-functions