问题
How does OpenMP
uses atomic
instructions inside reduction constructor?
Doesn't it rely on atomic instructions at all?
For instance, is the variable sum
in the code below accumulated with atomic
'+'
operator?
#include <omp.h>
#include <vector>
using namespace std;
int main()
{
int m = 1000000;
vector<int> v(m);
for (int i = 0; i < m; i++)
v[i] = i;
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < m; i++)
sum += v[i];
}
回答1:
How does OpenMP uses atomic instruction inside reduction? Doesn't it rely on atomic at all?
Since the OpenMP standard does not specify how the reduction
clause should (or not) be implemented (e.g., based on atomic
operations or not), its implementation may vary depending on each concrete implementation of the OpenMP standard.
For instance, is the variable sum in the code below accumulated with atomic + operator?
Nonetheless, from the OpenMP standard, one can read the following:
The reduction clause can be used to perform some forms of recurrence calculations (...) in parallel. For parallel and work-sharing constructs, a private copy of each list item is created, one for each implicit task, as if the private clause had been used. (...) The private copy is then initialized as specified above. At the end of the region for which the reduction clause was specified, the original list item is updated by combining its original value with the final value of each of the private copies, using the combiner of the specified reduction-identifier.
So based on that, one can infer that the variables used on the reduction clause will be private, and consequently, will not be updated atomically. Notwithstanding, even if that was not the case it would be unlikely, though, that a concrete implementation of the OpenMP standard would rely on the atomic
operation (for the instruction sum += v[i];
) since (in this case) would not be the most efficient strategy. For more information on why is that the case check the following SO threads:
- Why my parallel code using openMP atomic takes a longer time than serial code?;
- Why should I use a reduction rather than an atomic variable?.
Very informally, a more efficient approach than using atomic
would be for each thread to have their own copy of the variable sum
, and at the end of the parallel region
, each thread would save its copy into a resource shared among threads -- now, depending on how the reduction is implemented, atomic
operations might be used to update that shared resource. That resource would then be picked up by the master thread that would reduce its content and update the original sum
variable, accordingly.
More formally from OpenMP Reductions Under the Hood:
After having revisited parallel reductions in detail you might still have some open questions about how OpenMP actually transforms your sequential code into parallel code. In particular, you might wonder how OpenMP detects the portion in the body of the loop that performs the reduction. As an example, this or a similar code fragment can often be found in code samples:
#pragma omp parallel for reduction(+:x) for (int i = 0; i < n; i++) x -= some_value;
You could also use - as reduction operator (which is actually redundant to +). But how does OpenMP isolate the update step x-= some_value? The discomforting answer is that OpenMP does not detect the update at all! The compiler treats the body of the for-loop like this:
#pragma omp parallel for reduction(+:x) for (int i = 0; i < n; i++) x = some_expression_involving_x_or_not(x);
As a result, the modification of x could also be hidden behind an opaque > function call. This is a comprehensible decision from the point of view of a compiler developer. Unfortunately, this means that you have to ensure that all updates of x are compatible with the operation defined in the reduction clause.
The overall execution flow of a reduction can be summarized as follows:
- Spawn a team of threads and determine the set of iterations that each thread j has to perform.
- Each thread declares a privatized variant of the reduction variable x initialized with the neutral element e of the corresponding monoid.
- All threads perform their iterations no matter whether or how they involve an update of the privatized variable .
- The result is computed as sequential reduction over the (local) partial results and the global variable x. Finally, the result is written back to x.
回答2:
It is sometimes useful to check the generated assembly. For instance, GCC generated the following instructions for me (live demo):
...
add rcx, rax
xor eax, eax
lea rcx, [rsi+4+rcx*4]
.L4:
add eax, DWORD PTR [rdx]
add rdx, 4
cmp rdx, rcx
jne .L4
mov ecx, eax
.L3:
lock add DWORD PTR [rbx+12], ecx
...
These instructions are executed by all the threads inside the parallel region. The .L4:
label represents the part of the loop performed by the actual thread. The result of the thread-local partial sum is accumulated into eax
(by add eax, DWORD PTR [rdx]
), which is zeroed out first (xor eax, eax
). Finally, the result is moved to ecx
and from ecx
to the memory location that represents the sum
variable.
Note that the increment of this memory location is atomic due to the lock
prefix of the add
instruction.
With GCC, the answer to your question is thus YES — it does use atomic increment, but each thread does this only once at the end. (In theory, it would be possible to use the atomic increment of the shared result in each iteration, but this would be terribly inefficient. Even with disabled optimizations, GCC does not do that.)
回答3:
There may be atomics hidden in the reduction clause, but probably not where you expect them to be.
First, sum += v[i];
is not implemented using atomic operations because it doesn't need to be. The OpenMP specification is pretty clear that reduction variables are nothing more than private variables with the following additional semantics:
- unlike regular private variables, reduction variables are initialised to the zero value of the reduction operator; for
+
, that value is0
; - the individual private values of the reduction variable get combined (reduced) into the value of the original variable at the end of the OpenMP construct that the reduction is bound to.
It is that second part where atomic operations may be used by the implementation. As explained by @dreamcrash, the OpenMP specification does not prescribe how (or even when) exactly the reduction of the individual values happens. In any case, what reduction
does is equivalent to transforming
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < m; i++)
sum += v[i];
into
int sum = 0;
#pragma omp parallel
{
int sum_priv = 0;
#pragma omp for
for (int i = 0; i < m; i++)
sum_priv += a[i];
// BEGIN actual reduction
#pragma omp atomic update
sum += sum_priv;
// END actual reduction
}
The denoted part is where the actual reduction happens, i.e., the sum
variable gets updated with the values of the individual private copies. It could be implemented using atomic
as shown, but there are more efficient ways, e.g., pairwise tree reductions and that could even be a function call.
来源:https://stackoverflow.com/questions/65406478/how-does-openmp-use-the-atomic-instruction-inside-reduction-clause