问题

Background

I have an image analysis pipeline with parallelised steps. The pipeline is in python and the parallelisation is controlled by dask.distributed. The minimum processing set up has 1 scheduler + 3 workers with 15 processes each. In the first short step of the analysis I use 1 process/worker but all RAM of the node then in all other analysis steps all nodes and processes are used.

Issue

The admin will install HTCondor as a scheduler for the cluster.

Thought

In order order to have my code running on the new setup I was planning to use the approach showed in the dask manual for SGE because the cluster has a shared network files system.

# job1 
# Start a dask-scheduler somewhere and write connection information to file
qsub -b y /path/to/dask-scheduler --scheduler-file /path/to/scheduler.json

# Job2
# Start 100 dask-worker processes in an array job pointing to the same file
qsub -b y -t 1-100 /path/to/dask-worker --scheduler-file /path/to/scheduler.json

# Job3 
# Start a process with the python code where the client is started this way
client = Client(scheduler_file='/path/to/scheduler.json')

Question and advice

If I understood correctly with this approach I will start scheduler, workers and analysis as independent jobs (different HTCondor submit files). How can I make sure that the order of execution will be correct? Is there a way I can use the same processing approach I have being using before or will be more efficient to translate the code to work better with HTCondor? Thanks for the help!

回答1:

HTCondor JobQueue support has been merged (https://github.com/dask/dask-jobqueue/pull/245) and should now be available in Dask JobQueue (HTCondorCluster(cores=1, memory='100MB', disk='100MB') )

来源：https://stackoverflow.com/questions/53488603/dask-with-htcondor-scheduler

标签

python