Autoscaling EMR- is it required? Should I just use EC2? Should I just use Qubole?

前端 未结 2 1348
野的像风
野的像风 2021-02-11 07:48

In order to reduce the time for provisioning, we\'ve decided to keep up a dedicated EMR cluster with 5 instances (we expect to need about 5). In case we need more, we think we\'

相关标签:
2条回答
  • 2021-02-11 08:07

    The page you linked showed ways of either manually or programmatically increasing the nodes in your cluster. I couldn't find anything else about autoscaling for EMR.

    Unless we're missing some facts, you’d still have to come up with your own scaling algorithm and process. If you’re taking factors into account such as your job backlog, the units of time you’re paying for, the use of less-expensive “spot” instances, multiple clusters, etc, this is probably not a trivial exercise.

    In addition to increasing size of your cluster, there is also downsizing. EMR allows this (manually or programmatically) for task nodes, but they state they don't for core nodes. You'd have to terminate the core node through AWS functionality and risk losing data. If your workloads increase and decrease over time, core node downsizing would be valuable for keeping your costs lower.

    Qubole automatically takes care of all of these things out of the box. You run your jobs from the UI or API and it starts, sizes or resizes the cluster. When you're finished, it downsizes or terminates the cluster. It also allows you to have a minimum number of nodes constantly running at one time. I've also heard that the startup time for Qubole nodes is significantly faster than EMR.

    Hope this helps you.

    0 讨论(0)
  • 2021-02-11 08:08

    AWS does currently (as of late 2016) not support autoscaling out of the box as part of EMR. However, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.

    Basically, there are two main options to implement autoscaling for EMR clusters:

    1. Autoscaling Loop: A process that is running on a server and continuously monitors the cluster for its current load. Performance metrics (memory, CPU, I/O, etc) can be collected in regular intervals and stored in a database. Autoscaling rules are evaluated against the performance metrics, and the cluster's task nodes are scaled up or down if required.
    2. Event-Based Autoscaling: Using CloudWatch metrics (e.g., metrics for EMR, or metrics for EC2), you can programmatically define triggers that are fired under certain conditions (for instance, add nodes if average CPUUtilization of all nodes exceeds 80%).

    Both options have their pros and cons. The advantage of option 2 is that it is a server-less approach (does not require to run your own server). The downside is that CloudWatch metrics are collected in batches (typically five-minute intervals) and hence the data may be slightly delayed or less precise. Also, the event-based approach may not provide the required tools to inspect the current and historical state of your cluster scaling. Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.

    You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.

    0 讨论(0)
提交回复
热议问题