DAG(directed acyclic graph) dynamic job scheduler

前端 未结 3 1676
猫巷女王i
猫巷女王i 2021-01-29 22:55

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and

3条回答
  •  醉酒成梦
    2021-01-29 23:38

    1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.

    dagobah's web interface

    2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

    luigi's web interface

    Both of the two modules are mainly written in Python and web interfaces are included for convenient management.

    As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.

    3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.

    pinball's web interface

    4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.

    airflow's web interface

    Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html

提交回复
热议问题