AWS EMR Parallel Mappers?

后端 未结 1 1149
情书的邮戳
情书的邮戳 2021-01-25 09:48

I am trying to determine how many nodes I need for my EMR cluster. As part of best practices the recommendations are:

(Total Mappers needed for your job + Time taken to

相关标签:
1条回答
  • 2021-01-25 10:07

    To determine the number of parallel mappers , you will need to check this documentation from EMR called Task Configuration where EMR had a predefined mapping set of configurations for every instance type which would determine the number of mappers/reducers.
    http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html

    For example : Lets say you have 5 m1.xlarge core nodes. According to the default mapred-site.xml configuration values for that instance type from EMR docs, we have

    mapreduce.map.memory.mb = 768
    yarn.nodemanager.resource.memory-mb = 12288
    yarn.scheduler.maximum-allocation-mb = 12288 (same as above)
    

    You can simply divide the later with former setting to get the maximum number of mappers supported by one m1.xlarge node = (12288/768) = 16

    So, for the 5 node cluster , it would a max of 16*5 = 80 mappers that can run in parallel (considering a map only job). The same is the case with max parallel Reducers(30). You can do similar math for a combination of mappers and reducers.

    So, If you want to run more mappers in parallel , you can either re-size the cluster or reduce the mapreduce.map.memory.mb(and its heap mapreduce.map.java.opts) on every node and restart NM to

    To understand what the above mapred-site.xml properties mean and why you do need to do those calculations , you can refer it here : https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

    Note : The above calculations and statements are true if EMR stays in its default configuration using YARN capacity scheduler with DefaultResourceCalculator. If for example , you configure your capacity scheduler to use DominantResourceCalculator, it will consider VCPU's + Memory on every nodes (not just memory's) to decide on parallel number of mappers.

    0 讨论(0)
提交回复
热议问题