I'm currently optimizing parts of my code and therefore perform some benchmarking.
I have NxN
-matrices A
and T
and want to multiply them elementwise and save the result in A
again, i.e. A = A*T
. As this code is not parallelizable I expanded the assignment into
!$OMP PARALLEL DO
do j = 1, N
do i = 1, N
A(i,j) = T(i,j) * A(i,j)
end do
end do
!$OMP END PARALLEL DO
(Full minimal working example at http://pastebin.com/RGpwp2KZ.)
The strange thing happening now is that regardless of the number of threads (between 1 and 4) the execution time stays more or less the same (+- 10%) but instead the CPU time increases with greater number of threads. That made me think that all the threads do the same work (because I made a mistake regarding OpenMP) and therefore need the same time.
But on another computer (that has 96 CPU cores available) the program behaves as expected: With increasing thread number the execution time decreases. Surprisingly the CPU time decreases as well (up to ~10 threads, then rising again).
It might be that there are different versions of OpenMP
or gfortran
installed. If this could be the cause it'd be great if you could tell me how to find that out.
You could in theory make Fortran array operations parallel by using the Fortran-specific OpenMP WORKSHARE
directive:
!$OMP PARALLEL WORKSHARE
A(:,:) = T(:,:) * A(:,:)
!$OMP END PARALLEL WORKSHARE
Note that though this is quite standard OpenMP code, some compilers, and most notably the Intel Fortran Compiler (ifort
), implement the WORKSHARE
construct simply by the means of the SINGLE
construct, giving therefore no parallel speed-up whatsoever. On the other hand, gfortran
converts this code fragment into an implicit PARALLEL DO
loop. Note that gfortran
won't parallelise the standard array notation A = T * A
inside the worksharing construct unless it is written explicitly as A(:,:) = T(:,:) * A(:,:)
.
Now about the performance and the lack of speed-up. Each column of your A
and T
matrices occupies (2 * 8) * 729 = 11664
bytes. One matrix occupies 8.1 MiB and the two matrices together occupy 16.2 MiB. This probably exceeds the last-level cache size of your CPU. Also the multiplication code has very low compute intensity - it fetches 32 bytes of memory data per iteration and performs one complex multiplication in 6 FLOPs (4 real multiplications, 1 addition and 1 subtraction), then stores 16 bytes back to memory, which results in (6 FLOP)/(48 bytes) = 1/8 FLOP/byte
. If the memory is considered to be full duplex, i.e. it supports writing while being read, then the intensity goes up to (6 FLOP)/(32 bytes) = 3/16 FLOP/byte
. It follows that the problem is memory bound and even a single CPU core might be able to saturate all the available memory bandwidth. For example, a typical x86 core can retire two floating-point operations per cycle and if run at 2 GHz it could deliver 4 GFLOP/s of scalar math. To keep such core busy running your multiplication loop, the main memory should provide (4 GFLOP/s) * (16/3 byte/FLOP) = 21.3 GiB/s
. This quantity more or less exceeds the real memory bandwidth of current generation x86 CPUs. And this is only for a single core with non-vectorised code. Adding more cores and threads would not increase the performance since the memory simply cannot deliver data fast enough to keep the cores busy. Rather, the performance will suffer since having more threads adds more overhead. When run on a multisocket system like the one with 96 cores, the program gets access to more last-level cache and to higher main memory bandwidth (assuming a NUMA system with a separate memory controller in each CPU socket), thus the performance increases, but only because there are more sockets and not because there are more cores.
来源:https://stackoverflow.com/questions/17812003/parallelization-of-elementwise-matrix-multiplication