问题
Assume we want to count something in an OpenMP loop. Compare the reduction
int counter = 0;
#pragma omp for reduction( + : counter )
for (...) {
...
counter++;
}
with the atomic increment
int counter = 0;
#pragma omp for
for (...) {
...
#pragma omp atomic
counter++
}
The atomic access provides the result immediately, while a reduction only assumes its correct value at the end of the loop. For instance, reductions do not allow this:
int t = counter;
if (t % 1000 == 0) {
printf ("%dk iterations\n", t/1000);
}
thus providing less functionality.
Why would I ever use a reduction instead of atomic access to a counter?
回答1:
Short answer:
Performance
Long Answer:
Because an atomic variable comes with a price, and this price is synchronization. In order to ensure that there is no race conditions i.e. two threads modifying the same variable at the same moment, threads must synchronize which effectively means that you lose parallelism, i.e. threads are serialized.
Reduction on the other hand is a general operation that can be carried out in parallel using parallel reduction algorithms. Read this and this articles for more info about parallel reduction algorithms.
Addendum: Getting a sense of how a parallel reduction work
Imagine a scenario where you have 4
threads and you want to reduce a 8
element array A. What you could do this in 3 steps (check the attached image to get a better sense of what I am talking about):
- Step 0. Threads with index
i<4
take care of the result of summingA[i]=A[i]+A[i+4]
. - Step 1. Threads with index
i<2
take care of the result of summingA[i]=A[i]+A[i+4/2]
. - Step 2. Threads with index
i<4/4
take care of the result of summingA[i]=A[i]+A[i+4/4]
At the end of this process you will have the result of your reduction in the first element of A
i.e. A[0]
回答2:
Performance is the key point.
Consider the following program
#include <stdio.h>
#include <omp.h>
#define N 1000000
int a[N], sum;
int main(){
double begin, end;
begin=omp_get_wtime();
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("serial %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for
for(int i =0; i<N; i++)
# pragma omp atomic
sum+=a[i];
end=omp_get_wtime();
printf("atomic %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for reduction(+:sum)
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("reduction %g\n",end-begin);
}
When executed (gcc -O3 -fopenmp), it gives :
serial 0.00491182 atomic 0.0786559 reduction 0.001103
So approximately atomic=20xserial=80xreduction
The 'reduction' exploits properly the parallelism, and with a 4 cores computer, we can get 3--6 performances boosts vs "serial".
Now, "atomic" is 20 times longer than "serial". Not only, as explained in the previous answer, the serialization of memory accesses disables parallelism, but all memory accesses are done by atomic operations. These operations require at least 20--50 cycles on modern computers and will dramatically slow down your performances if used intensively.
来源:https://stackoverflow.com/questions/54186268/why-should-i-use-a-reduction-rather-than-an-atomic-variable