问题
The snippet of my serial code is shown below.
Program main
use omp_lib
Implicit None
Integer :: i, my_id
Real(8) :: t0, t1, t2, t3, a = 0.0d0
!$ t0 = omp_get_wtime()
Call CPU_time(t2)
! ------------------------------------------ !
Do i = 1, 100000000
a = a + Real(i)
End Do
! ------------------------------------------ !
Call CPU_time(t3)
!$ t1 = omp_get_wtime()
! ------------------------------------------ !
Write (*,*) "a = ", a
Write (*,*) "The wall time is ", t1-t0, "s"
Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main
The elapsed time:
By using omp directives do
and atomic
, I convert serial code into parallel code. However, the parallel program is slower than the serial program. I don't understand why this happened. The next is my parallel code snippet:
Program main
use omp_lib
Implicit None
Integer, Parameter :: n_threads = 8
Integer :: i, my_id
Real(8) :: t0, t1, t2, t3, a = 0.0d0
!$ t0 = omp_get_wtime()
Call CPU_time(t2)
! ------------------------------------------ !
!$OMP Parallel Num_threads(n_threads) shared(a)
!$OMP Do
Do i = 1, 100000000
!$OMP Atomic
a = a + Real(i)
End Do
!$OMP End Do
!$OMP End Parallel
! ------------------------------------------ !
Call CPU_time(t3)
!$ t1 = omp_get_wtime()
! ------------------------------------------ !
Write (*,*) "a = ", a
Write (*,*) "The wall time is ", t1-t0, "s"
Write (*,*) "The CPU time is ", t3-t2, "s"
End Program main
The elapsed time:
So my question is Why my parallel code using openMP atomic takes a longer time than serial code?
回答1:
You are applying an atomic
operation to the same variable in every single loop iteration. Moreover, that variable has interdependencies among those loop iterations. Naturally, that comes with additional overheads (e.g., synchronization, cost of serialization, and CPU cycles) when comparing with the sequential version. Furthermore, you are probably getting a lot of cache misses due to threads invalidating their caches.
This code is the typical code that should be using a reduction
of the variable a
(i.e., !$omp parallel do reduction(+:a))
instead of an atomic operation. With the reduction operation, each thread will have a private copy of the variable 'a'
, and at end of the parallel region
, threads will reduce their copies of the variable 'a'
(using the '+'
operator) into a single value that will be propagated to the variable 'a'
of the main thread.
You can find a more detailed answer about the differences between atomic vs. reduction on this SO thread. In that thread, there is even a code, which (just like yours) its atomic
version is several orders of magnitude slower than its sequential counterpart (i.e., 20x slower). In that case it is even worst than yours (i.e., 20x Vs 10x).
来源:https://stackoverflow.com/questions/64823158/why-my-parallel-code-using-openmp-atomic-takes-a-longer-time-than-serial-code