Dealing with large switch statements in CUDA

问题

I understand that branching in CUDA is not recommended as it can adversely affect performance. In my work, I find myself having to implement large switch statements that contain upward of a few dozen cases.

Does anyone have any idea how badly this will affect performance. (The official documentation isn't very specific) Also does anyone have a more efficient way of handling this portion?

回答1:

A good way to avoid multiple switches is to implement function table and select function from table by index based in you switch condition. CUDA allows you to use function pointers on __device__ function in kernels.

回答2:

The GPU runs threads in groups of 32, called warps. Whenever different threads in a warp go through different paths in the code, the GPU has to run the entire warp multiple times, once for each code path.

To deal with this issue, called warp divergence, you want to arrange your threads so that the threads in a given warp go through as few different code paths as possible. When you have done that, you pretty much just have to bite the bullet and accept the loss in performance caused by any remaining warp divergence. In some cases, there might not be anything you can do to arrange your threads. If so, and if the different code paths are a big part of your kernel or overall workload, the task may not be a good fit for the GPU.

It doesn't matter how you implement the different code paths. if-else, switch, predication (in PTX or SASS), branch tables or anything else -- if it comes down to the threads in a warp running in different paths, you get a hit on performance.

It also doesn't matter how many threads go through each path, just the total number of different paths in the warp.

Here is another answer on this that goes into a bit more detail.

来源：https://stackoverflow.com/questions/11185718/dealing-with-large-switch-statements-in-cuda

标签

parallel-processing

cuda

switch-statement

statements