I have a general question about reductions with OpenMP that\'s bothered me for a while. My question is in regards to merging the partial sums in a reduction. It can either be
The OpenMP implementation will make a decision about the best way to do the reduction based on the implementor's knowledge of the specific characteristics of the hardware it's running on. On system with a small number of CPUs, it will probably do a linear reduction. On a system with hundreds or thousands of cores (e.g. GPU, Intel Phi) it will likely do a log(n) reduction.
The time spent in the reduction might not matter for very large problems, but for smaller problems it could be add a few percent to the total runtime. Your implementation might be just as fast in many cases, but I doubt it would ever be faster, so why not let OpenMP decide on the optimal reduction strategy?