I would like to know how to specify mapreduce configurations such as mapred.task.timeout , mapred.min.split.size etc. , when running a streaming job using custo
In the context of Amazon Elastic MapReduce (Amazon EMR), you are looking for Bootstrap Actions:
Bootstrap actions allow you to pass a reference to a script stored in Amazon S3. This script can contain configuration settings and arguments related to Hadoop or Elastic MapReduce. Bootstrap actions are run before Hadoop starts and before the node begins processing data. [emphasis mine]
Section Running Custom Bootstrap Actions from the CLI provides a generic usage example:
& ./elastic-mapreduce --create --stream --alive \
--input s3n://elasticmapreduce/samples/wordcount/input \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--output s3n://myawsbucket
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/download.sh
In particular, there are separate bootstrap actions to configure Hadoop and Java:
You can specify Hadoop settings via bootstrap action Configure Hadoop, which allows you to set cluster-wide Hadoop settings, for example:
$ ./elastic-mapreduce --create \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
--args "--site-config-file,s3://myawsbucket/config.xml,-s,mapred.task.timeout=0"
You can specify custom JVM settings via bootstrap action Configure Daemons:
This predefined bootstrap action lets you specify the heap size or other Java Virtual Machine (JVM) options for the Hadoop daemons. You can use this bootstrap action to configure Hadoop for large jobs that require more memory than Hadoop allocates by default. You can also use this bootstrap action to modify advanced JVM options, such as garbage collection behavior.
The provided example sets the heap size to 2048 and configures the Java namenode option:
$ ./elastic-mapreduce –create –alive \
--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-daemons \
--args --namenode-heap-size=2048,--namenode-opts=-XX:GCTimeRatio=19