问题
I use EMR to create new instances and process the jobs and then shutdown instances.
My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to schedule EMR jobflow instead.
回答1:
There is Apache Oozie Workflow Scheduler for Hadoop to do just that.
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
Here is a simple example of Elastic Map Reduce bootstrap actions for configuring apache oozie : https://github.com/lila/emr-oozie-sample
But to let you know oozie is a bit complicated and if and only if you have a lot of jobs to be scheduled/monitored/maintained then only you shall go for oozie
or else just create a bunch of cron
jobs if you have say just 2 or 3 jobs to be scheduled periodically.
You may also look into and explore simple workflow from Amazon.
来源:https://stackoverflow.com/questions/14014486/tool-ways-to-schedule-amazons-elastic-mapreduce-jobs