I\'m running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation
flag as documented here. According to those docs, \"
in the EMR version 3.x, this maximizeResourceAllocation was implemented with a reference table: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/vcorereference.tsv
it used by a shell script: maximize-spark-default-config
, in the same repo, you can take a look how they implemented this.
maybe in the new EMR version 4, this reference table was somehow wrong... i believe you can find all this AWS script in your EC2 instance of EMR, should be located in /usr/lib/spark or /opt/aws or something like this.
anyway, at least, you can write your own bootstrap action
scripts for this in EMR 4, with a correct reference table, similar to the implementation in EMR 3.x
moreover, since we are going to use STUPS infrastructure, worth take a look the STUPS appliance for Spark: https://github.com/zalando/spark-appliance
you can explicitly specify the number of cores by setting senza parameter DefaultCores
when you deploy your spark cluster
some of highlight of this appliance comparing to EMR are:
able to use it with even t2 instance type, auto-scalable based on roles like other STUPS appliance, etc.
and you can easily deploy your cluster in HA mode with zookeeper, so no SPOF on master node, HA mode in EMR is currently still not possible, and i believe EMR is mainly designed for "large clusters temporarily for ad-hoc analysis jobs", not for "dedicated cluster that is permanently on", so HA mode will not be possible in near further with EMR.