I am using the R package foreach()
with %dopar%
to do long (~days) calculations in parallel. I would like the ability to stop the entire set of cal
Instead of trying to break out of a loop, I write a small file to the disk when I've reached my terminal loop, then have all remaining iterations simply skip based on the existence of that file.
Checking if a file exists costs us less than a milisecond of computing time.
# 1.4 seconds to check if a file exists a million times
system.time(lapply(1:1e6, function(x) file.exists("checker.txt")))
user system elapsed
1.204 0.233 1.437
This is great when you don't have a fixed number of iterations or your process can finish before all of the iterations are complete (like a convergence, for example)
library(foreach)
alist <- foreach(i = 1:5000) %dopar% {
if(file.exists("checker.txt")) {
return(NULL)
} else {
if(i = 20) {
write("", "checker.txt") # write an empty file
}
return(i)
}
}
file.remove("checker.txt")
The great thing about this is that even if your list is extremely long, if you just unlist() you only get the values.
> length(alist)
[1] 5000
> unlist(res)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Don't bother trying to break, instead, just "skip the rest"!
Steve Weston's original answer essentially answered this. But here is a slightly modified version of his answer which also preserves two additional features in the way I need them: (1) random number generation; (2) printing run-time diagnostics.
suppressMessages(library(doMPI))
comb <- function(x, y) {
if(inherits(y, 'error')) {
stop(y)
}
rbind(x, y) # forces the row names to be 'y'
}
myfunc <- function() {
writeLines(text="foreach log", con="log.txt")
foreach(i=1:12, .errorhandling='pass', .combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
set.seed(100)
sink("log.txt", append=TRUE)
if(i==6) {
stop('testing abort')
}
Sys.sleep(10)
cat("Completed task", i, "\n")
sink(NULL)
rnorm(5,mean=i)
}
}
myerr <- function(e) {
cat(sprintf('Caught error: %s\n', conditionMessage(e)))
mpi.abort(cl$comm)
}
cl <- startMPIcluster(4)
registerDoMPI(cl)
r <- tryCatch(myfunc(), error=myerr)
closeCluster(cl)
When this file is sourced, it exits as intended with an error message
> source("exp2.R")
4 slaves are spawned successfully. 0 failed.
Caught error: testing abort
[ganges.local:16325] MPI_ABORT invoked on rank 0 in communicator with errorcode 0
The 'log.txt' files provides correct diagnostics up to the point of error, and then provides additional error information. Crucially, the execution of all tasks is halted as soon as the stop() in the foreach loop is encountered: it does not wait until the entire foreach loop has completed. Thus I only see the 'Completed task' message up to i=4. (Note that if Sys.sleep() is shorter, then later tasks may be started before the mpi.abort() is processed.)
If I change the stop condition to be "i==100", then the stop and hence the error is not triggered. The code successfully exists without an error message, and r is a 2D array with dimensions 12*5.
Incidentally, it seems that I don't actually need .inorder=FALSE (I think that just gives me a small speed increase in the event that an error is found).
The answer I got from REvolution Technical support: "no--foreach doesn't currently have a way to stop all parallel computations on an error to any one".
I am not having much luck getting foreach
to do what I want, so here is a solution using the parallel
package which seems to do what I want. I use the intermediate
option in mcparallel()
to pass results from my function, do.task()
, immediately to the function check.res()
. If do.task()
throws an error, then this is used in check.res()
to trigger calling tools::pskill
to explicitly kill all workers. This might not be very elegant, but it works in the sense that it causes an instant stop of all worked. Furthermore, I can simply inherit all the variables I need for the processing in do.task()
from the present environment. (In reality do.task()
is a much more complex function requiring many variables to be passed in.)
library(parallel)
# do.task() and check.res() inherit some variables from enclosing environment
do.task <- function(x) {
cat("Starting task", x, "\n")
Sys.sleep(5*x)
if(x==stopat) {
stop("Error in job", x) # thrown to mccollect() which sends it to check.res()
}
cat(" Completed task", x, "\n")
return(10*x)
}
check.res <- function(r) { # r is list of results so far
cat("Called check.res\n")
sendKill <- FALSE
for(j in 1:Njob) { # check whether need to kill
if(inherits(r[[j]], 'try-error')) {
sendKill <- TRUE
}
}
if(sendKill) { # then kill all
for(j in 1:Njob) {
cat("Killing job", job[[j]]$pid, "\n")
tools::pskill(job[[j]]$pid) # mckill not accessible
}
}
}
Tstart <- Sys.time()
stopat <- 3
Njob <- 4
job <- vector("list", length=Njob)
for(j in 1:Njob) {
job[[j]]<- mcparallel(do.task(j))
}
res <- mccollect(job, intermediate=check.res) # res is in order 1:Njob, regardless of how long jobs took
cat("Collected\n")
Tstop <- Sys.time()
print(difftime(Tstop,Tstart))
for(j in 1:Njob) {
if(inherits(res[[j]], 'try-error')) {
stop("Parallel part encountered an error")
}
}
This gives the following screen dump and results for variable res
> source("exp5.R")
Starting task 1
Starting task 2
Starting task 3
Starting task 4
Completed task 1
Called check.res
Called check.res
Completed task 2
Called check.res
Called check.res
Called check.res
Killing job 21423
Killing job 21424
Killing job 21425
Killing job 21426
Called check.res
Killing job 21423
Killing job 21424
Killing job 21425
Killing job 21426
Called check.res
Killing job 21423
Killing job 21424
Killing job 21425
Killing job 21426
Collected
Time difference of 15.03558 secs
Error in eval(expr, envir, enclos) : Parallel part encountered an error
> res
$`21423`
[1] 10
$`21424`
[1] 20
$`21425`
[1] "Error in do.task(j) : Error in job3\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in do.task(j): Error in job3>
$`21426`
NULL
It sounds like you want an impatient version of the "stop" error handling. You could implement that by writing a custom combine function, and arranging for foreach
to call it as soon as each result is returned. To do that you need to:
combine
on-the-fly, like doMPI
or doRedis
.multicombine
.inorder
to FALSE
.init
to something (like NULL
)Here's an example that does that:
library(foreach)
parfun <- function(errval, n) {
abortable <- function(errfun) {
comb <- function(x, y) {
if (inherits(y, 'error')) {
warning('This will leave your parallel backend in an inconsistent state')
errfun(y)
}
c(x, y)
}
foreach(i=seq_len(n), .errorhandling='pass', .export='errval',
.combine='comb', .inorder=FALSE, .init=NULL) %dopar% {
if (i == errval)
stop('testing abort')
Sys.sleep(10)
i
}
}
callCC(abortable)
}
Note that I also set the error handling to "pass" so foreach
will call the combine function with an error object. The callCC
function is used to return from the foreach
loop regardless of the error handling used within foreach
and the backend. In this case, callCC
will call the abortable
function, passing it a function object that is used force callCC
to immediately return. By calling that function from the combine function we can escape from the foreach
loop when we detect an error object, and have callCC
return that object. See ?callCC
for more information.
You can actually use parfun
without a parallel backend registered and verify that the foreach
loop "breaks" as soon as it executes a task that throws an error, but that could take awhile since the tasks are executed sequentially. For example, this takes 20 seconds to execute if no backend is registered:
print(system.time(parfun(3, 4)))
When executing parfun
in parallel, we need to do more than simply break out of the foreach
loop: we also need to stop the workers, otherwise they will continue to compute their assigned tasks. With doMPI
, the workers can be stopped using mpi.abort
:
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
r <- parfun(getDoParWorkers(), getDoParWorkers())
if (inherits(r, 'error')) {
cat(sprintf('Caught error: %s\n', conditionMessage(r)))
mpi.abort(cl$comm)
}
Note that the cluster object can't be used after the loop aborts, because things weren't properly cleaned up, which is why the normal "stop" error handling doesn't work this way.
It's not a direct answer to your question, but using when()
you can avoid entering the loop if a condition is satisfied:
x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %:%
when( !is.element(k, 2:6) ) %do%
{
cat("Element ", k, "\n")
Sys.sleep(0.5)
k
}
EDIT:
I forgot something: I think it's by design, that you cannot just stop the foreach loop. If you run the loop in parallel, each turn is processed independently, which means when you stop the entire loop for k=2
it is not predictable if the process for k=1
terminated already or is still running. Hence, using the when()
condition gives you a deterministic result.
EDIT 2: Another solution considering your comment.
shouldStop <- FALSE
x <- foreach(k = 1:10, .combine="cbind", .errorhandling="stop") %do%
{
if( !shouldStop ){
# put your time consuming code here
cat("Element ", k, "\n")
Sys.sleep(0.5)
shouldStop <- shouldStop || is.element(k, 2:6)
k
}
}
Using this solution, the processes which are running while the stop condition becomes true are still calculated to an end, but you avoid time consumption on all upcoming processes.