Conditionally execute multiple branches one by one

痞子三分冷 提交于 2021-01-29 20:10:22

问题


Note

  • Please read and understand the question thoroughly
  • It cannot be solved by simple BranchPythonOperator / ShortCircuitOperator

We have an unusual multiplexer-like use-case in our workflow

                                +-----------------------+
                                |                       |
                  +------------>+  branch-1.begin-task  |
                  |             |                       |
                  |             +-----------------------+
                  |
                  |
                  |             +-----------------------+
                  |             |                       |
                  +------------>+  branch-2.begin-task  |
                  |             |                       |
+------------+    |             +-----------------------+
|            |    |
|  MUX-task  +----+                         +
|            |    |                         |
+------------+    |
                  |                         |
                  +- -- -- -- ->
                  |                         |
                  |
                  |                         |
                  |                         +
                  |
                  |             +-----------------------+
                  |             |                       |
                  +------------>+  branch-n.begin-task  |
                                |                       |
                                +-----------------------+

The flow is expected to work as follows

  • MUX-task listens for events on an external queue (single queue)
  • each event on queue triggers execution of one of the branches (branch-n.begin-task)
  • one-by-one, as events arrive, the MUX-task must trigger execution of respective branch
  • once all branches have been triggered, the MUX-task completes

Assumptions

  • Exactly n events arrive on queue, one for triggering each branch
  • n is dynamically-known: it's value is defined in a Variable

Limitations

  • The external queue where events arrive is only one
  • we can't have n queues (one per branch) since branches grow with time (n is dynamically defined)

We are not able to come up with a solution within Airflow's set of operators and sensors (or any such thing available out-of-the-hood in Airflow) to build this

  1. Sensors can be used for listening events on external queue; but we have to listen for multiple events, not one
  2. BranchPythonOperator can be used to trigger execution of a single branch out of many, but it immediately marks remaining branches as skipped

Primary bottleneck

Because of the 2nd limitation above, even a custom-operator combining functionality of a Sensor and BranchPythonOperator won't work.

We have tried to brainstorm around a fancy combination of Sensors, DummyOperator and trigger_rules to achieve this, but have had no success thus far.

Is this doable in Airflow?


UPDATE-1

Here's some background info to understand the context of workflow

  • we have an ETL pipeline to sync MySQL tables (across multiple Aurora databases) to our data-lake
  • to overcome the impact of our sync pipeline on production databases, we have decided to do this
    • for each database, create a snapshot (restore AuroraDB cluster from last backup)
    • run MySQL sync pipeline using that snapshot
    • at then end of sync, terminate the snapshot (AuroraDB cluster)
  • the snapshot lifecycle events of Aurora snapshot restore process are published to an SQS queue
    • single queue for all databases
    • this setup was done by our DevOps team (different AWS account, we don't have access to the underlying Lambdas / SQS / infra)

回答1:


XCOMs to the rescue!


We decided to model the tasks as follows (both tasks are custom operators)

  • The MUX-task is more like an iterative-sensor: it keeps listening for events on queue and takes some action against each event arriving on queue
  • All branch-x.begin-tasks are simple sensors: they listen for publishing of an XCOM (who's name is in a pre-defined specific format)

The workflow runs as follows

  • The MUX-task listens for events on queue (listening part is enclosed in a for-loop with as many iterations as the number of branches)
  • When an event arrives, the MUX-task picks it up; it identifies which 'branch' should be triggered and publishes an XCOM for the respective branch
  • The respective branch's sensor picks up that XCOM on it's next poke and the branch starts executing. In effect, branch's sensor merely acts as a gateway that opens up with an external event (XCOM) and allows execution of branch

Since there are too many sensors (one per branch), we would most likely be employing mode='reschedule' to overcome deadlocks


  • Since the described approach relies heavily on polling, we don't deem it to be super efficient.
  • A reactive triggering based approach would be more desirable, but we haven't been able to work it out

UPDATE-1

  • Looks like 'reactive' approach is achievable if we could model each branch as a separate DAG and instead of publishing XCOMs for each branch, trigger the branch's DAG just like TriggerDagRunOperator does
  • But since our monolithic DAG is generated programmatically via complex logic, this change would have been quite hard (lots of code rewrite). So we decided to continue with the poll-based approach and live with few minutes of extra delay in a pipeline that already takes several hours to complete

UPDATE-2

[with reference to UPDATE-1 section of question]

Since our actual implementation required us to just wait for creation of database, we decided to simplify the workflow as follows

  • database endpoints were fixed via DNS (they didn't change every time Aurora snapshot was restored)
  • we did away with the MUX-task (and so also the SQS queue for Aurora restore lifecycle events)
  • each branch's begin-task branch-x.begin-task was modelled as a simple sensor that tried firing a dummy SQL query (SELECT 1) to check if database endpoint has become active or not


来源:https://stackoverflow.com/questions/61304429/conditionally-execute-multiple-branches-one-by-one

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!