Scheduling spark jobs on a timely basis

筅森魡賤 提交于 2019-12-04 18:22:43
Joe Harris

Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines

  • Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
    • Airflow has built in support for the fact that jobs scheduled jobs often need to be rerun and/or backfilled. Make sure you build your pipelines to support this.
  • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
  • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
  • Oozie: Insane XML based job definitions. Here be dragons. ;-)
  • Chronos: ¯\_(ツ)_/¯

Philosophy:

Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

If you can make it idempotent (running it again creates identical results) then that’s even better.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!