How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

后端未结

关注

 2  2023

终归单人心 2021-02-06 08:42

I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custo

2条回答

渐次进展 (楼主)

2021-02-06 09:39
In the context of Amazon Elastic MapReduce (Amazon EMR), you are looking for Bootstrap Actions:

Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data. [emphasis mine]

Section Running Custom Bootstrap Actions from the CLI provides a generic usage example:
```
& ./elastic-mapreduce --create --stream --alive \
--input s3n://elasticmapreduce/samples/wordcount/input \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--output s3n://myawsbucket 
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh  
```
In particular, there are separate bootstrap actions to configure Hadoop and Java:

Hadoop (cluster)

You can specify Hadoop settings via bootstrap action Configure Hadoop, which allows you to set cluster-wide Hadoop settings, for example:
```
$ ./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"     
```
Java (JVM)

You can specify custom JVM settings via bootstrap action Configure Daemons:

This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.

The provided example sets the heap size to 2048 and configures the Java namenode option:
```
$ ./elastic-mapreduce –create –alive \
  --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
  --args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19   
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...

How to specify mapred configurations & java options with custom jar in CLI using Amazon's EMR?

Hadoop (cluster)

Java (JVM)