In order to reduce the time for provisioning, we\'ve decided to keep up a dedicated EMR cluster with 5 instances (we expect to need about 5). In case we need more, we think we\'
AWS does currently (as of late 2016) not support autoscaling out of the box as part of EMR. However, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.
Basically, there are two main options to implement autoscaling for EMR clusters:
Both options have their pros and cons. The advantage of option 2 is that it is a server-less approach (does not require to run your own server). The downside is that CloudWatch metrics are collected in batches (typically five-minute intervals) and hence the data may be slightly delayed or less precise. Also, the event-based approach may not provide the required tools to inspect the current and historical state of your cluster scaling. Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.
You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.