How can I ensure that my Fortran FORALL construct is being parallelized?

陌路散爱 提交于 2019-11-30 15:17:15

If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:

FORALL loop at line X in file Y has been vectorized

I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.

FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.

Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.

With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:

!$omp parallel workshare
FORALL (i=,j=,...)
    <assignment>
END FORALL
!$omp end parallel

gfortran -fopenmp blah.f90 -o blah

Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the RHS as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated by the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.

The best way is to measure the clock time of the calculation. Try it with and without parallel code. If the clock time decreases, then your parallel code is working. The Fortran intrinsic system_clock, called before and after the code block, will give you the clock time. The intrinsic cpu_time will give you the cpu time, which might go up when code in run multi-threaded due to overhead.

The lore is the FORALL is not as useful as was thought when introduced into the language -- that it is more of a initialization construct. Compilers are equally adept at optimizing regular loops.

Fortran compilers vary in their abilities to implement true parallel processing without it being explicitly specified, e.g., with OpenMP or MPI. What compiler are you using?

To get automatic multi-threading, I've used ifort. Manually, I've used OpenMP. With both of these, you can compile your program with and without the parallelization and measure the difference.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!