I\'m wondering what the overhead of performing a cuda kernel call is in C/C++ such as the following:
somekernel1<<>>(args);
som
The host side overhead of a kernel launch uaing the runtime API is only about 15-30 microseconds on non-WDDM Windows platforms. On WDDM platforms (which I don't use), I understand it can be much, much higher, plus there is some sort of batching mechanism in the driver which tries to amortise the cost by doing multiple operations in a single driver side operation.
Generally, there will be a performance increase in "fusing" multiple data operations which would otherwise be done in separate kernels into a single kernel, where the algorithms allow it. The GPU has much higher arithmetic peak performance than peak memory bandwidth, so the more FLOPs which can be executed per memory transaction (and per kernel "setup code"), the better the performance of the kernel will be. On the other hand, trying to write a "swiss army knife" style kernel which tries to cram completely disparate operations into a single piece of code is never a particularly good idea, because it increases register pressure and reduce the efficiency of things like L1, constant memory and texture caches.
Which way you choose to go should really be guided by the nature of the code/algorithms. I don't believe there is a single "correct" answer to this question that can be applied in all circumstances.