Dataproc set number of vcores per executor container

瘦欲@ 提交于 2019-12-20 04:24:26

问题


I'm building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want my job to eat up as much of the cluster resources as possible, and I have a very good idea of the requirements.

I've been playing around with turning off dynamic allocation and setting up the executor instances and cores myself. Currently I'm using 6 instances and 30 cores a pop.

Perhaps it's more of a yarn question, but I'm finding the relationship between container vCores and my spark executor cores a bit confusing. In the YARN application manager UI I see that 7 containers are spawned (1 driver and 6 executors) and each of these use 1 vCore. Within spark however I see that the executors themselves are using the 30 cores I specified.

So I'm curious if the executors are trying to do 30 tasks in parallel on what is essentially a 1 core box. Or maybe the vCore displayed in the AM gui is erroneous?

If its the former, wondering what the best way is to set this application up so I end up with one executor per worker node, and all the CPUs are used.


回答1:


The vCore displayed in the YARN GUI is erroneous; this is a not-well-documented but a known issue with the capacity-scheduler, which is Dataproc's default. Notably, with the default settings on Dataproc, YARN is only doing resource bin-packing based on memory rather than CPUs; the benefit is that this is more versatile for oversubscribing CPUs to varying degrees as desired per-workload, especially if something is IO bound, but the downside is that YARN won't be responsible for carving out CPU usage in a fixed manner.

See https://stackoverflow.com/a/43302303/3777211 for some discussion of changing to fair-scheduler to see the vcores allocation accurately represented in YARN. However, in your case there's probably no benefit to doing so; making YARN do bin-packing across both dimensions is more of a "shared multitenant cluster" issue, and only complicates the scheduling problem.

In your case, the best way to set your application up is just to ignore what YARN says about vcores; if you want just one executor per worker node, then set the executor memory size to the maximum that will fit in YARN per node, and make cores per executor equal to the total number of cores per node.



来源:https://stackoverflow.com/questions/52450536/dataproc-set-number-of-vcores-per-executor-container

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!