问题
I used the following logic to restart the uncompleted jobs on single-node Spring Batch application:
public void restartUncompletedJobs() {
try {
jobRegistry.register(new ReferenceJobFactory(documetPipelineJob));
List<String> jobs = jobExplorer.getJobNames();
for (String job : jobs) {
Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(job);
for (JobExecution runningJob : runningJobs) {
runningJob.setStatus(BatchStatus.FAILED);
runningJob.setEndTime(new Date());
jobRepository.update(runningJob);
jobOperator.restart(runningJob.getId());
}
}
} catch (Exception e) {
LOGGER.error(e.getMessage(), e);
}
}
Right now I'm trying to make it working on the two-node cluster. Both of the application on every node will be pointed to the shared PostgreSQL database.
Let's consider the following example: I have 2 job instances - the jobInstance1
is running right now on node1
and the jobInstance2
is running on node2
. Node1
is restarted for some reason during jobInstance1
execution. After node1
restart the spring batch application tries to restart the uncompleted jobs with a logic presented above - it sees that there are 2 uncompleted job instances - jobInstance1
and jobInstance2
(which is correctly running on node2
) and tries to restart both of them. This way instead to restart the only jobInstance1
- it will restart both jobInstance1
and jobInstance2
.. but the jobInstance2
should not be restarted because it is correctly executing right now on node2
.
How to correctly restart during the application startup the not completed jobs(before the previous application termination) and prevent the situation when the jobs like jobInstance2
will be also restarted?
UPDATED
This is the solution provided in the answer below:
Get the job instances of your job with JobOperator#getJobInstances
For each instance, check if there is a running execution using JobOperator#getExecutions.
2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using JobOperator#restart.
I have a question regarding #2.1 - will Spring Batch automatically restart uncompleted jobs with a running execution after application restart or do I need to do manual actions to do so?
回答1:
Your logic is not restarting uncompleted jobs. Your logic is taking currently running job executions, setting their status to FAILED
and restarting them. Your logic should not find running executions, it should look for not currently running executions, especially failed ones and restart them.
How to correctly restart the failed jobs and prevent the situation when the jobs like jobInstance2 will be also restarted?
In pseudo code, what you need to do to achieve this is:
- Get the job instances of your job with
JobOperator#getJobInstances
For each instance, check if there is a running execution using
JobOperator#getExecutions
.2.1 If there is a running execution, move to next instance (in order to let the execution finish either successfully or with a failure)
2.2 If there is no currently running execution, check the status of the last execution and restart it if failed using
JobOperator#restart
.
In your scenario:
jobInstance1
should be restarted in step 2.2jobInstance2
should be filtered in step 2.1 since there is a running execution for it on node 2.
来源:https://stackoverflow.com/questions/51568654/spring-batch-correctly-restart-uncompleted-jobs-in-clustered-environment