Scheduling spark jobs on a timely basis

爷,独闯天下 提交于 2019-12-14 03:43:05

问题


Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow

Thanks in advance.


回答1:


Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines

  • Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
    • Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
  • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
    • Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
    • Check out the Azkaban CLI project for programmatic job creation. https://github.com/mtth/azkaban (examples https://github.com/joeharris76/azkaban_examples)
  • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
  • Oozie: Insane XML based job definitions. Here be dragons. ;-)
  • Chronos: ¯\_(ツ)_/¯

Philosophy:

Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

If you can make it idempotent (running it again creates identical results) then that’s even better.



来源:https://stackoverflow.com/questions/41831708/scheduling-spark-jobs-on-a-timely-basis

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!