I\'m building a spark application which will run on Dataproc. I plan to use ephemeral clusters, and spin a new one up for each execution of the application. So I basically want
The vCore displayed in the YARN GUI is erroneous; this is a not-well-documented but a known issue with the capacity-scheduler
, which is Dataproc's default. Notably, with the default settings on Dataproc, YARN is only doing resource bin-packing based on memory rather than CPUs; the benefit is that this is more versatile for oversubscribing CPUs to varying degrees as desired per-workload, especially if something is IO bound, but the downside is that YARN won't be responsible for carving out CPU usage in a fixed manner.
See https://stackoverflow.com/a/43302303/3777211 for some discussion of changing to fair-scheduler
to see the vcores allocation accurately represented in YARN. However, in your case there's probably no benefit to doing so; making YARN do bin-packing across both dimensions is more of a "shared multitenant cluster" issue, and only complicates the scheduling problem.
In your case, the best way to set your application up is just to ignore what YARN says about vcores; if you want just one executor per worker node, then set the executor memory size to the maximum that will fit in YARN per node, and make cores per executor equal to the total number of cores per node.