R Running foreach dopar loop on HPC MPIcluster

一世执手 提交于 2020-01-14 14:28:07

问题


I got access to an HPC cluster with a MPI partition.

My problem is that -no matter what I try- my code (which works fine on my PC) doesn't run on the HPC cluster. The code looks like this:

library(tm) library(qdap) library(snow) library(doSNOW) library(foreach)

> cl<- makeCluster(30, type="MPI")
> registerDoSNOW(cl)
> np<-getDoParWorkers()
> np
> Base = "./Files1a/"
> files = list.files(path=Base,pattern="\\.txt");
> 
> for(i in 1:length(files)){
...some definitions and variable generation...
+ text<-foreach(k = 1:10, .combine='c') %do%{
+   text= if (file.exists(paste("./Files", k, "a/", files[i], sep=""))) paste(tolower(readLines(paste("./Files", k, "a/", files[i], sep=""))) , collapse=" ") else ""
+ }
+ 
+ docs <- Corpus(VectorSource(text))
+ 
+ for (k in 1:10){
+ ID[k] <- paste(files[i], k, sep="_")
+ }
+ data <- as.data.frame(docs) 
+ data[["docs"]]=ID
+ rm(docs)
+ data <- sentSplit(data, "text")
+ 
+ frequency=NULL
+ cs <- ceiling(length(POLKEY$x) / getDoParWorkers()) 
+ opt <- list(chunkSize=cs) 
+ frequency<-foreach(j = 2: length(POLKEY$x), .options.mpi=opt, .combine='cbind') %dopar% ...
+ write.csv(frequency, file =paste("./Result/output", i, ".csv", sep=""))
+ rm(data, frequency)
+ }

When I run the batch job the session gets killed at the time limit. Whereas I receive the following message after the MPI cluster initialization:

Loading required namespace: Rmpi
--------------------------------------------------------------------------
PMI2 initialized but returned bad values for size and rank.
This is symptomatic of either a failure to use the
"--mpi=pmi2" flag in SLURM, or a borked PMI2 installation.
If running under SLURM, try adding "-mpi=pmi2" to your
srun command line. If that doesn't work, or if you are
not running under SLURM, try removing or renaming the
pmi2.h header file so PMI2 support will not automatically
be built, reconfigure and build OMPI, and then try again
with only PMI1 support enabled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:         ...
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
    30 slaves are spawned successfully. 0 failed.

Unfortunately, it seems that the loop doesn't go through once as no output is returned.

For the sake of completeness, my batch file:

#!/bin/bash -l
#SBATCH --job-name MyR
#SBATCH --output MyR-%j.out
#SBATCH --nodes=5
#SBATCH --ntasks-per-node=6
#SBATCH --mem=24gb
#SBATCH --time=00:30:00

MyRProgram="$HOME/R/hpc_test2.R"

cd $HOME/R

export R_LIBS_USER=$HOME/R/Libs2

# start R with my R program
module load R

time R --vanilla -f $MyRProgram

Does anybody have a suggestion how to solve the problem? What am I doing wrong?

Thanks in advance for your help!


回答1:


Your script is an MPI application, so you need to execute it appropriately via Slurm. The Open MPI FAQ has a special section on how to do that:

https://www.open-mpi.org/faq/?category=slurm

The most important point is that your script shouldn't execute R directly, but should execute it via the mpirun command, using something like:

mpirun -np 1 R --vanilla -f $MyRProgram

My guess is that the "PMI2" error is caused by not executing R via mpirun. I don't think the "fork" message indicates a real problem and it happens to me at times. I think it happens because R calls "fork" when initializing, but this has never caused a problem for me. I'm not sure why I only get this message occasionally.

Note that it is very important to tell mpirun to only launch one process since the other processes will be spawned, so you should use the mpirun -np 1 option. If Open MPI was properly built with Slurm support, then Open MPI should know where to launch those processes when they are spawned, but if you don't use -np 1, then all 30 processes launched via mpirun will spawn 30 processes each, causing a huge mess.

Finally, I think you should tell makeCluster to spawn only 29 processes to avoid running a total of 31 MPI processes. Depending on your network configuration, even that much oversubscription can cause problems.

I would create the cluster object as follows:

library(snow)
library(Rmpi)
cl<- makeCluster(mpi.universe.size() - 1, type="MPI")

That's safer and makes it easier to keep your R script and job script in sync with each other.



来源:https://stackoverflow.com/questions/34120644/r-running-foreach-dopar-loop-on-hpc-mpicluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!