How does OpenMP use the atomic instruction inside reduction clause?

问题

How does OpenMP uses atomic instructions inside reduction constructor? Doesn't it rely on atomic instructions at all?

For instance, is the variable sum in the code below accumulated with atomic '+' operator?

#include <omp.h>
#include <vector>

using namespace std;
int main()
{
  int m = 1000000; 
  vector<int> v(m);
  for (int i = 0; i < m; i++)
    v[i] = i;

  int sum = 0;
  #pragma omp parallel for reduction(+:sum)
  for (int i = 0; i < m; i++)
    sum += v[i];
}

回答1:

How does OpenMP uses atomic instruction inside reduction? Doesn't it rely on atomic at all?

Since the OpenMP standard does not specify how the reduction clause should (or not) be implemented (e.g., based on atomic operations or not), its implementation may vary depending on each concrete implementation of the OpenMP standard.

For instance, is the variable sum in the code below accumulated with atomic + operator?

Nonetheless, from the OpenMP standard, one can read the following:

The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.

So based on that, one can infer that the variables used on the reduction clause will be private, and consequently, will not be updated atomically. Notwithstanding, even if that was not the case it would be unlikely, though, that a concrete implementation of the OpenMP standard would rely on the atomic operation (for the instruction sum += v[i];) since (in this case) would not be the most efficient strategy. For more information on why is that the case check the following SO threads:

Why my parallel code using openMP atomic takes a longer time than serial code?;
Why should I use a reduction rather than an atomic variable?.

Very informally, a more efficient approach than using atomic would be for each thread to have their own copy of the variable sum, and at the end of the parallel region, each thread would save its copy into a resource shared among threads -- now, depending on how the reduction is implemented, atomic operations might be used to update that shared resource. That resource would then be picked up by the master thread that would reduce its content and update the original sum variable, accordingly.

More formally from OpenMP Reductions Under the Hood:

After having revisited parallel reductions in detail you might still have some open questions about how OpenMP actually transforms your sequential code into parallel code. In particular, you might wonder how OpenMP detects the portion in the body of the loop that performs the reduction. As an example, this or a similar code fragment can often be found in code samples:
 #pragma omp parallel for reduction(+:x)
 for (int i = 0; i < n; i++)
     x -= some_value;
You could also use - as reduction operator (which is actually redundant to +). But how does OpenMP isolate the update step x-= some_value? The discomforting answer is that OpenMP does not detect the update at all! The compiler treats the body of the for-loop like this:
#pragma omp parallel for reduction(+:x)
     for (int i = 0; i < n; i++)
         x = some_expression_involving_x_or_not(x);
As a result, the modification of x could also be hidden behind an opaque > function call. This is a comprehensible decision from the point of view of a compiler developer. Unfortunately, this means that you have to ensure that all updates of x are compatible with the operation defined in the reduction clause.

The overall execution flow of a reduction can be summarized as follows:

Spawn a team of threads and determine the set of iterations that each thread j has to perform.

Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the corresponding monoid.

All threads perform their iterations no matter whether or how they involve an update of the privatized variable .

The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result is written back to x.

回答2:

It is sometimes useful to check the generated assembly. For instance, GCC generated the following instructions for me (live demo):

  ...
  add rcx, rax
  xor eax, eax
  lea rcx, [rsi+4+rcx*4]
.L4:
  add eax, DWORD PTR [rdx]
  add rdx, 4
  cmp rdx, rcx
  jne .L4
  mov ecx, eax
.L3:
  lock add DWORD PTR [rbx+12], ecx
  ...

These instructions are executed by all the threads inside the parallel region. The .L4: label represents the part of the loop performed by the actual thread. The result of the thread-local partial sum is accumulated into eax (by add eax, DWORD PTR [rdx]), which is zeroed out first (xor eax, eax). Finally, the result is moved to ecx and from ecx to the memory location that represents the sum variable.

Note that the increment of this memory location is atomic due to the lock prefix of the add instruction.

With GCC, the answer to your question is thus YES — it does use atomic increment, but each thread does this only once at the end. (In theory, it would be possible to use the atomic increment of the shared result in each iteration, but this would be terribly inefficient. Even with disabled optimizations, GCC does not do that.)

回答3:

There may be atomics hidden in the reduction clause, but probably not where you expect them to be.

First, sum += v[i]; is not implemented using atomic operations because it doesn't need to be. The OpenMP specification is pretty clear that reduction variables are nothing more than private variables with the following additional semantics:

unlike regular private variables, reduction variables are initialised to the zero value of the reduction operator; for +, that value is 0;
the individual private values of the reduction variable get combined (reduced) into the value of the original variable at the end of the OpenMP construct that the reduction is bound to.

It is that second part where atomic operations may be used by the implementation. As explained by @dreamcrash, the OpenMP specification does not prescribe how (or even when) exactly the reduction of the individual values happens. In any case, what reduction does is equivalent to transforming

int sum = 0;

#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < m; i++)
  sum += v[i];

into

int sum = 0;

#pragma omp parallel
{
  int sum_priv = 0;

  #pragma omp for
  for (int i = 0; i < m; i++)
    sum_priv += a[i];

  // BEGIN actual reduction
  #pragma omp atomic update
  sum += sum_priv;
  // END actual reduction
}

The denoted part is where the actual reduction happens, i.e., the sum variable gets updated with the values of the individual private copies. It could be implemented using atomic as shown, but there are more efficient ways, e.g., pairwise tree reductions and that could even be a function call.

来源：https://stackoverflow.com/questions/65406478/how-does-openmp-use-the-atomic-instruction-inside-reduction-clause

标签

c++

multithreading

parallel-processing

openmp