Suggestion for scheduling tool(s) for building hadoop based data pipelines

允我心安 提交于 2019-12-06 12:44:05
  • Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
  • Airflow: Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
  • Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
  • Oozie: Insane XML based job definitions. Here be dragons. ;-)

IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.

Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.

When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.

If you can make it idempotent (running it again creates identical results) then that’s even better.

This post will give you an initial idea about different possible workflows

http://bytepawn.com/luigi-airflow-pinball.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!