问题
Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow
Thanks in advance.
回答1:
Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines
- Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
- Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
- Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
- Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
- Check out the Azkaban CLI project for programmatic job creation. https://github.com/mtth/azkaban (examples https://github.com/joeharris76/azkaban_examples)
- Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
- Oozie: Insane XML based job definitions. Here be dragons. ;-)
- Chronos: ¯\_(ツ)_/¯
Philosophy:
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
If you can make it idempotent (running it again creates identical results) then that’s even better.
来源:https://stackoverflow.com/questions/41831708/scheduling-spark-jobs-on-a-timely-basis