问题
I am running python scripts on a computing cluster (slurm) with two stages and they are sequential. I wrote two python scripts, one for Stage 1 and another for Stage 2. Every morning I check if all Stage 1 jobs are completed visually. Only then, I start Stage 2.
Is there a more elegant/automated way by combining all stages and job management in a single python script? How can I tell if the job has completed?
The workflow is similar to the following:
while not job_list.all_complete():
for job in job_list:
if job.empty():
job.submit_stage1()
if job.complete_stage1():
job.submit_stage2()
sleep(60)
回答1:
You have several courses of action:
- use the Slurm Python API to manage the jobs
- use job dependencies (search for
--dependency
in the sbatch man page) - have the submission script for stage 1 submit the job for stage 2 when it finished
- use a workflow management system such as
- Fireworks https://materialsproject.github.io/fireworks/
- Bosco https://osg-bosco.github.io/docs/
- Slurm pipelines https://github.com/acorg/slurm-pipeline
- Luigi https://github.com/spotify/luigi
回答2:
You haven't given a lot to go off of for how to determine if a job is finished, but a common way to solve this problem is to have the jobs create a sentinel file that you can look for, something like COMPLETE
.
To do this you just add something like
# At the end of stage 1,
job_num = 1234
open('/shared/file/system/or/server/JOB_{job_num}/COMPLETE', 'x').close()
And then you just poll every once in a while to see if you have a COMPLETE
file for all of the jobs before starting stage 2.
来源:https://stackoverflow.com/questions/55404236/python-cluster-jobs-management