Between Apache Oozie, Spotify/Luigi and airbnb/airflow, what are the pros and cons for each of them?
I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift.
I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome.
- Azkaban: Nice UI, relatively simple, accessible for non-programmers. Has a longish history at LinkedIn.
- Check out the Azkaban CLI project for programmatic job creation. I have an Azkaban example workflows project on GitHub.
- Airflow: Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird.
- Luigi: OK UI, workflows are pure Python, requires solid grasp of Python coding and object oriented concepts, hence not suitable for non-programmers.
- Oozie: Insane XML based job definitions. Here be dragons. ;-)
IMHO, Azkaban enforces simplicity (can’t use features that don’t exist) and the others subtly encourage complexity.
Simpler pipelines are better than complex pipelines: Easier to create, easier to understand (especially when you didn’t create) and easier to debug/fix.
When complex actions are needed you want to encapsulate them in a way that either completely succeeds or completely fails.
If you can make it idempotent (running it again creates identical results) then that’s even better.
This post will give you an initial idea about different possible workflows
来源:https://stackoverflow.com/questions/35733441/suggestion-for-scheduling-tools-for-building-hadoop-based-data-pipelines