I am moving a program parallelized by OpenMP to Cluster. The cluster is using Lava 1.0 as scheduler and has 8 cores in each nodes. I used a MPI wrapper in the job script to
Taking into account the information that you have specified in the comments, your best option is to:
-x
(you already do that);-n 1
(you already do that);OMP_NUM_THREADS
to the number of cores per node (you already do that);Your job script should look like this:
#BSUB -q queue_name
#BSUB -x
#BSUB -n 1
#BSUB -J n1p1o8
##BSUB -o outfile.email
#BSUB -e err
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true
date
~/my_program ~/input.dat ~/output.out
date
OMP_PROC_BIND
is part of OpenMP 3.1 specification. If using compiler which adheres to an older version of the standard, you should use the vendor-specific setting, e.g. GOMP_CPU_AFFINITY
for GCC and KMP_AFFINITY
for Intel compilers. Binding threads to cores prevents the operating system from moving around threads between different processor cores, which speeds up the executing, especially on NUMA systems (e.g. machines with more than one CPU sockets and separate memory controller in each socket) where data locality is very important.
If you'd like to run many copies of your program over different input files, then submit array jobs. With LSF (and I guess with Lava too) this is done by changing the job script:
#BSUB -q queue_name
#BSUB -x
#BSUB -n 1
#BSUB -J n1p1o8[1-20]
##BSUB -o outfile.email
#BSUB -e err_%I
export OMP_NUM_THREADS=8
export OMP_PROC_BIND=true
date
~/my_program ~/input_${LSF_JOBINDEX}.dat ~/output_${LSF_JOBINDEX}.out
date
This submits an array job of 20 subjobs (-J n1p1o8[1-20]
). %I
in -e
is replaced by the job number so you'll get a separate err
file from each job. The LSF_JOBINDEX
environment variable is set to the current job index, i.e. it will be 1
in the first job, 2
in the second and so on.
My question about the memory usage of your program was not about how much memory does it consume. It was about how large is the typical dataset that is processed in a single OpenMP loop. If the dataset is not small enough to fit into the last-level cache of the CPU(s), then memory bandwidth comes into consideration. If your code does heavy local processing on each data item, then it might scale with the number of threads. If on the other side it does simple and light processing, then memory bus might get saturated even by a single thread, especially if the code is properly vectorised. Usually this is measured by the so-called operational intensity in FLOPS/byte. It gives the amount of data processing that happens before the next data element is fetched from memory. High operational intensity means that a lot of number crunching happens in the CPU and data is only seldom transferred to/from memory. Such programs scale almost linearly with the number of threads, no matter what the memory bandwidth is. On the other side, codes with very low operational intensity are memory-bound and they leave the CPU underutilised.
A program that is heavily memory-bound doesn't scale with the number threads but with the available memory bandwidth. For example, on a newer Intel or AMD system, each CPU socket has its own memory controller and memory data path. On such systems the memory bandwidth is a multiple of the bandwidth of a single socket, e.g. a system with two sockets delivers twice the memory bandwidth of a single-socket system. In this case you might see improvement in the code run time whenever both sockets are used, e.g. if you set OMP_NUM_THREADS
to be equal to the total number of cores or if you set OMP_NUM_THREADS
to be equal to 2
and tell the runtime to put both threads on different sockets (this is a plausible scenario when threads are executing vectorised code and a single thread is able to saturate the local memory bus).