Given n
partial sums it\'s possible to sum all the partial sums in log2 parallel steps. For example assume there are eight threads with eight partial sums:
Actually, it is quite simple to implement that cleanly with tasks using a recursive divide-and-conquer approach. This is almost textbook code.
void operation(int* p1, int* p2, size_t bins)
{
for (int i = 0; i < bins; i++)
p1[i] += p2[i];
}
void reduce(int** arrs, size_t bins, int begin, int end)
{
assert(begin < end);
if (end - begin == 1) {
return;
}
int pivot = (begin + end) / 2;
/* Moving the termination condition here will avoid very short tasks,
* but make the code less nice. */
#pragma omp task
reduce(arrs, bins, begin, pivot);
#pragma omp task
reduce(arrs, bins, pivot, end);
#pragma omp taskwait
/* now begin and pivot contain the partial sums. */
operation(arrs[begin], arrs[pivot], bins);
}
/* call this within a parallel region */
#pragma omp single
reduce(at, bins, 0, n);
As far as i can tell, there are no unnecessary synchronizations and there is no weird polling on critical sections. It also works naturally with a data size different than your number of ranks. I find it very clean and easy to understand. So I do indeed think this is better than both of your solutions.
But let's look at how it performs in practice*. For that we can use Score-p and Vampir:
*bins=10000
so the reduction actually takes a little bit of time. Executed on a 24-core Haswell system w/o turbo. gcc 4.8.4, -O3
. I added some buffer around the actual execution to hide initialization/post-processing
The picture reveals what is happening at any thread within the application on a horizontal time-axis. The tree implementations from top to bottom:
omp for
loopomp critical
kind of tasking.omp task
This shows nicely how the specific implementations actually execute. Now it seems that the for loop is actually the fastest, despite the unnecessary synchronizations. But there are still a number of flaws in this performance analysis. For example, I didn't pin the threads. In practice NUMA (non-uniform memory access) matters a lot: Does the core does have this data in it's own cache / memory of it's own socket? This is where the task solution becomes non-deterministic. The very significant variance among repetitions is not considered in the simple comparison.
If the reduction operation becomes variable in runtime, then the task solution will become better than thy synchronized for loop.
The critical
solution has some interesting aspect, the passive threads are not continuously waiting, so they will more likely consume CPU resources. This can be bad for performance e.g. in case of turbo mode.
Remember that the task
solution has more optimization potential by avoiding spawning tasks that immediately return. How these solutions perform also highly depends on the specific OpenMP runtime. Intel's runtime seems to do much worse for tasks.
My recommendation is:
Here is the timeline for the linear proccess_data_v1
from this question.
So I thought about OpenMP reduction. The tricky part seems to be getting the data from the at
array inside the loop without a copy. I do initialize the worker array with NULL
and simply move the pointer the first time:
void meta_op(int** pp1, int* p2, size_t bins)
{
if (*pp1 == NULL) {
*pp1 = p2;
return;
}
operation(*pp1, p2, bins);
}
// ...
// declare before parallel region as global
int* awork = NULL;
#pragma omp declare reduction(merge : int* : meta_op(&omp_out, omp_in, 100000)) initializer (omp_priv=NULL)
#pragma omp for reduction(merge : awork)
for (int t = 0; t < n; t++) {
meta_op(&awork, at[t], bins);
}
Surprisingly, this doesn't look too good:
top is icc 16.0.2
, bottom is gcc 5.3.0
, both with -O3
.
Both seem to implement the reduction serialized. I tried to look into gcc
/ libgomp
, but it's not immediately apparent to me what is happening. From intermediate code / disassembly, they seem to be wrapping the final merge in a GOMP_atomic_start
/end
- and that seems to be a global mutex. Similarly icc
wraps the call to the operation
in a kmpc_critical
. I suppose there wasn't much optimization going into costly custom reduction operations. A traditional reduction can be done with a hardware-supported atomic operation.
Notice how each operation
is faster because the input is cached locally, but due to the serialization it is overall slower. Again this is not a perfect comparison due to high variances, and earlier screenshots were with different gcc
version. But the trend is clear, and I also have data on the cache effects.