My CUDA kernel is 50% faster than vulkan compute shader using the same code. Nvidia\'s backend compiler generates suboptiomal instructions. How to fix this issue?<