Amazon Elastic MapReduce Bootstrap Actions not working

问题

I have tried the following combinations of bootstrap actions to increase the heap size of my job but none of them seem to work:

--mapred-key-value mapred.child.java.opts=-Xmx1024m 
--mapred-key-value mapred.child.ulimit=unlimited

--mapred-key-value mapred.map.child.java.opts=-Xmx1024m 
--mapred-key-value mapred.map.child.ulimit=unlimited

-m mapred.map.child.java.opts=-Xmx1024m
-m mapred.map.child.ulimit=unlimited 

-m mapred.child.java.opts=-Xmx1024m 
-m mapred.child.ulimit=unlimited

What is the right syntax?

回答1:

You have two options to achieve this:

Custom JVM Settings

In order to apply custom settings, You might want to have a look at the Bootstrap Actions documentation for Amazon Elastic MapReduce (Amazon EMR), specifically action Configure Daemons:

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

An example is provided as well, which sets the heap size to 2048 and configures the Java namenode option:

$ ./elastic-mapreduce –create –alive \
  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19

Predefined JVM Settings

Alternatively, as per the FAQ How do I configure Hadoop settings for my job flow?, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup - this refers to action Configure Memory-Intensive Workloads, which allows you to set cluster-wide Hadoop settings to values appropriate for job flows with memory-intensive workloads, for example:

$ ./elastic-mapreduce --create \
--bootstrap-action \
  s3://elasticmapreduce/bootstrap-actions/configurations/latest/memory-intensive

The specific configuration settings applied by this predefined bootstrap action are listed in Hadoop Memory-Intensive Configuration Settings.

Good luck!

回答2:

Steffen's answer is good and works. On the other hand if you just want something quick-and-dirty and just want to replace one or two variables, then you're probably looking to just change it via the command line like the following:

elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
  --args "-m,mapred.child.java.opts=-Xmx999m"

I've seen another documentation, albeit an older one, that simply quotes the entire expression within one quote like the following:

--bootstrap-action "s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m \
    mapred.child.java.opts=-Xmx999m"    ### I tried this style, it no longer works!

At any rate, this is not easily found in the AWS EMR documentation. I suspect that mapred.child.java.opts is one of the most overridden variables-- I was also looking for an answer when I got a GC error: "java.lang.OutOfMemoryError: GC overhead limit exceeded" and stumbled on this page. The default of 200m is just too small (documentation on defaults).

Good luck!

来源：https://stackoverflow.com/questions/10024476/amazon-elastic-mapreduce-bootstrap-actions-not-working

标签

Hadoop

amazon-web-services

MapReduce

elastic-map-reduce

amazon-emr