Spark + EMR using Amazon's “maximizeResourceAllocation” setting does not use all cores/vcores

前端 未结 3 1741
我寻月下人不归
我寻月下人不归 2021-01-30 04:34

I\'m running an EMR cluster (version emr-4.2.0) for Spark using the Amazon specific maximizeResourceAllocation flag as documented here. According to those docs, \"

3条回答
  •  一个人的身影
    2021-01-30 05:12

    in the EMR version 3.x, this maximizeResourceAllocation was implemented with a reference table: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/vcorereference.tsv

    it used by a shell script: maximize-spark-default-config, in the same repo, you can take a look how they implemented this.

    maybe in the new EMR version 4, this reference table was somehow wrong... i believe you can find all this AWS script in your EC2 instance of EMR, should be located in /usr/lib/spark or /opt/aws or something like this.

    anyway, at least, you can write your own bootstrap action scripts for this in EMR 4, with a correct reference table, similar to the implementation in EMR 3.x

    moreover, since we are going to use STUPS infrastructure, worth take a look the STUPS appliance for Spark: https://github.com/zalando/spark-appliance

    you can explicitly specify the number of cores by setting senza parameter DefaultCores when you deploy your spark cluster

    some of highlight of this appliance comparing to EMR are:

    able to use it with even t2 instance type, auto-scalable based on roles like other STUPS appliance, etc.

    and you can easily deploy your cluster in HA mode with zookeeper, so no SPOF on master node, HA mode in EMR is currently still not possible, and i believe EMR is mainly designed for "large clusters temporarily for ad-hoc analysis jobs", not for "dedicated cluster that is permanently on", so HA mode will not be possible in near further with EMR.

提交回复
热议问题