I am struggling to find the proper way to execute a hybrid OpenMP/MPI job with MPICH (hydra).
I am easily able to launch the processes and they do make threads, but they are stuck bound to the same core as their master thread whatever type of -bind-to
I tried.
If I explicitly set GOMP_CPU_AFFINITY
to 0-15
I get all threads spread but only provided if I have 1 process per node. I don't want that, I want one process per socket.
Setting OMP_PROC_BIND=false
does not have a noticeable effect.
An example of many different combinations I tried
export OMP_NUM_THREADS=8
export OMP_PROC_BIND="false"
mpiexec.hydra -n 2 -ppn 2 -envall -bind-to numa ./a.out
What I get is all process sitting on one of the cores 0-7
with 100% and several threads on cores 8-15
but only one of them close to 100% (they are waiting on the first process).
Since libgomp
is missing the equivalent of the respect
clause of Intel's KMP_AFFINITY
, you could hack it around by providing a wrapper script that reads the list of allowed CPUs from /proc/PID/status
(Linux-specific):
#!/bin/sh
GOMP_CPU_AFFINITY=$(grep ^Cpus_allowed_list /proc/self/status | grep -Eo '[0-9,-]+')
export GOMP_CPU_AFFINITY
exec $*
This should work with -bind-to numa
then.
I do have a somewhat different solution for binding OpenMP threads to sockets / NUMA nodes when running a mixed MPI / OpenMP code, whenever the MPI library and the OpenMP runtime do not collaborate well by default. The idea is to use numactl
and its binding properties. This has even the extra advantage of not only binding the threads to the socket, but also the memory, forcing good memory locality and maximising the bandwidth.
To that end, I first disable any MPI and/or OpenMP binding (with the corresponding mpiexec
option for teh former, and with setting OMP_PROC_BIND
to false
for the later). Then I use the following omp_bind.sh
shell script:
#!/bin/bash
numactl --cpunodebind=$(( $PMI_ID % 2 )) --membind=$(( $PMI_ID % 2 )) "$@"
And I run my code this way:
OMP_PROC_BIND="false" OMP_NUM_THREADS=8 mpiexec -ppn 2 -bind-to-none omp_bind.sh a.out args
Depending on the number of sockets on the machine, the 2
would need to be adjusted on the shell. Likewise, the PMI_ID
depends on the version of mpiexec
used. I saw sometimes MPI_RANK
, PMI_RANK
, etc.
But anyway, I always found a way of getting it to work and the memory binding comes very handy sometimes, especially to avoid the potential pitfall of the IO buffers eating up all memory on the first NUMA node, leading to the code's memory for the process running on the first socket, allocating memory on the second NUMA node.
来源:https://stackoverflow.com/questions/33696673/executing-hybrid-openmp-mpi-jobs-in-mpich