Why should I use a reduction rather than an atomic variable?

拥有回忆 提交于 2021-02-04 07:28:51

问题


Assume we want to count something in an OpenMP loop. Compare the reduction

int counter = 0;
#pragma omp for reduction( + : counter )
for (...) {
    ...
    counter++;
}

with the atomic increment

int counter = 0;
#pragma omp for
for (...) {
    ...
    #pragma omp atomic
    counter++
}

The atomic access provides the result immediately, while a reduction only assumes its correct value at the end of the loop. For instance, reductions do not allow this:

int t = counter;
if (t % 1000 == 0) {
    printf ("%dk iterations\n", t/1000);
}

thus providing less functionality.

Why would I ever use a reduction instead of atomic access to a counter?


回答1:


Short answer:

Performance

Long Answer:

Because an atomic variable comes with a price, and this price is synchronization. In order to ensure that there is no race conditions i.e. two threads modifying the same variable at the same moment, threads must synchronize which effectively means that you lose parallelism, i.e. threads are serialized.

Reduction on the other hand is a general operation that can be carried out in parallel using parallel reduction algorithms. Read this and this articles for more info about parallel reduction algorithms.


Addendum: Getting a sense of how a parallel reduction work

Imagine a scenario where you have 4 threads and you want to reduce a 8 element array A. What you could do this in 3 steps (check the attached image to get a better sense of what I am talking about):

  • Step 0. Threads with index i<4 take care of the result of summing A[i]=A[i]+A[i+4].
  • Step 1. Threads with index i<2 take care of the result of summing A[i]=A[i]+A[i+4/2].
  • Step 2. Threads with index i<4/4 take care of the result of summing A[i]=A[i]+A[i+4/4]

At the end of this process you will have the result of your reduction in the first element of A i.e. A[0]




回答2:


Performance is the key point.

Consider the following program

#include <stdio.h>
#include <omp.h>
#define N 1000000
int a[N], sum;

int main(){
  double begin, end;

  begin=omp_get_wtime();
  for(int i =0; i<N; i++)
    sum+=a[i];
  end=omp_get_wtime();
  printf("serial %g\t",end-begin);

  begin=omp_get_wtime();
# pragma omp parallel for
  for(int i =0; i<N; i++)
# pragma omp atomic
    sum+=a[i];
  end=omp_get_wtime();
  printf("atomic %g\t",end-begin);

  begin=omp_get_wtime();
# pragma omp parallel for reduction(+:sum)
  for(int i =0; i<N; i++)
    sum+=a[i];
  end=omp_get_wtime();
  printf("reduction %g\n",end-begin);
}

When executed (gcc -O3 -fopenmp), it gives :

serial 0.00491182 atomic 0.0786559 reduction 0.001103

So approximately atomic=20xserial=80xreduction

The 'reduction' exploits properly the parallelism, and with a 4 cores computer, we can get 3--6 performances boosts vs "serial".

Now, "atomic" is 20 times longer than "serial". Not only, as explained in the previous answer, the serialization of memory accesses disables parallelism, but all memory accesses are done by atomic operations. These operations require at least 20--50 cycles on modern computers and will dramatically slow down your performances if used intensively.



来源:https://stackoverflow.com/questions/54186268/why-should-i-use-a-reduction-rather-than-an-atomic-variable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!