CUDA performance penalty when running in Windows

前端 未结 2 1004
谎友^
谎友^ 2020-12-30 09:09

I\'ve noticed a big performance hit when I run my CUDA application in Windows 7 (versus Linux). I think I may know where the slowdown occurs: For whatever reason, the Win

相关标签:
2条回答
  • 2020-12-30 09:50

    Even it's been almost 3 years since the issue has been active, I still consider it necesssary to provide my findings.

    I've been in the same situation: the same cuda programme elapsed for 5ms in Ubuntu cuda 8.0 while over 30ms in Windows 10 cuda 10.1. Both with GTX 1080Ti. However, in Windows when I changed the compiler from VS Studio to cmd's nvcc compiler suddenly the programme was boosted to the same speed as the Linux one.

    This suggests that maybe the problem comes from Visual Studio.

    0 讨论(0)
  • 2020-12-30 10:02

    There is a fair amount of overhead in sending GPU hardware commands through the WDDM stack.

    As you've discovered, this means that under WDDM (only) GPU commands can get "batched" to amortize this overhead. The batching process may (probably will) introduce some latency, which can be variable, depending on what else is going on.

    The best solution under windows is to switch the operating mode of the GPU from WDDM to TCC, which can be done via the nvidia-smi command, but it is only supported on Tesla GPUs and certain members of the Quadro family of GPUs -- i.e. not GeForce. (It also has the side effect of preventing the device from being used as a windows accelerated display adapter, which might be relevant for a Quadro device or a few specific older Fermi Tesla GPUs.)

    AFAIK there is no officially documented method to circumvent or affect the WDDM batching process in the driver, but unofficially I've heard , according to Greg@NV in this link the command to issue after the cuda kernel call is cudaEventQuery(0); which may/should cause the WDDM batch queue to "flush" to the GPU.

    As Greg points out, extensive use of this mechanism will wipe out the amortization benefit, and may do more harm than good.

    EDIT: moving forward to 2016, a newer recommendation for a "low-impact" flush of the WDDM command queue would be cudaStreamQuery(stream);

    EDIT2: Using recent drivers on windows, you should be able to place Titan family GPUs in TCC mode, assuming you have some other GPU set up for primary display. The nvidia-smi tool will allow you to switch modes (using nvidia-smi --help for more info).

    Additional info about the TCC driver model can be found in the windows install guide, including that it may reduce the latency of kernel launches.

    0 讨论(0)
提交回复
热议问题