问题
When qsub
ing jobs on a StarCluster / SGE cluster, is there an easy way to ensure that each node receives at most one job at a time? I am having issues where multiple jobs end up on the same node leading to out of memory (OOM) issues.
I tried using -l cpu=8
but I think that does not check the number of USED cores just the number of cores on the box itself.
I also tried -l slots=8
but then I get:
Unable to run job: "job" denied: use parallel environments instead of requesting slots explicitly.
回答1:
In your config file (.starcluster/config) add this section:
[plugin sge]
setup_class = starcluster.plugins.sge.SGEPlugin
slots_per_host = 1
回答2:
Largely depends on how the cluster resources are configured i.e. memory limits, etc. However, one thing to try is to request a lot of memory for each job:
-l h_vmem=xxG
This will have side-effect of excluding other jobs from running on a node by virtue that most of the memory on that node is already requested by another previously running job.
Just make sure the memory you request is not above the allowable limit for the node. You can see if it bypassing this limit by checking the output of qstat -j <jobid>
for errors.
回答3:
I accomplished this by setting the number of slots on each my nodes to 1 using:
qconf -aattr queue slots "[nodeXXX=1]" all.q
来源:https://stackoverflow.com/questions/25672896/ensuring-one-job-per-node-on-starcluster-sungridengine-sge