问题
Note
- Please read and understand the question thoroughly
- It cannot be solved by simple BranchPythonOperator / ShortCircuitOperator
We have an unusual multiplexer-like use-case in our workflow
+-----------------------+
| |
+------------>+ branch-1.begin-task |
| | |
| +-----------------------+
|
|
| +-----------------------+
| | |
+------------>+ branch-2.begin-task |
| | |
+------------+ | +-----------------------+
| | |
| MUX-task +----+ +
| | | |
+------------+ |
| |
+- -- -- -- ->
| |
|
| |
| +
|
| +-----------------------+
| | |
+------------>+ branch-n.begin-task |
| |
+-----------------------+
The flow is expected to work as follows
MUX-tasklistens for events on an external queue (single queue)- each event on queue triggers execution of one of the branches (branch-n.begin-task)
- one-by-one, as events arrive, the MUX-task must trigger execution of respective branch
- once all branches have been triggered, the MUX-task completes
Assumptions
- Exactly
nevents arrive on queue, one for triggering each branch nis dynamically-known: it's value is defined in a Variable
Limitations
- The external queue where events arrive is only one
- we can't have
nqueues (one per branch) since branches grow with time (n is dynamically defined)
We are not able to come up with a solution within Airflow's set of operators and sensors (or any such thing available out-of-the-hood in Airflow) to build this
Sensors can be used for listening events on external queue; but we have to listen for multiple events, not one- BranchPythonOperator can be used to trigger execution of a single branch out of many, but it immediately marks remaining branches as skipped
Primary bottleneck
Because of the 2nd limitation above, even a custom-operator combining functionality of a Sensor and BranchPythonOperator won't work.
We have tried to brainstorm around a fancy combination of Sensors, DummyOperator and trigger_rules to achieve this, but have had no success thus far.
Is this doable in Airflow?
UPDATE-1
Here's some background info to understand the context of workflow
- we have an ETL pipeline to sync
MySQLtables (across multipleAuroradatabases) to our data-lake - to overcome the impact of our sync pipeline on production databases, we have decided to do this
- for each database, create a snapshot (restore AuroraDB cluster from last backup)
- run
MySQLsync pipeline using that snapshot - at then end of sync, terminate the snapshot (
AuroraDBcluster)
- the snapshot lifecycle events of
Aurorasnapshot restore process are published to anSQSqueue- single queue for all databases
- this setup was done by our DevOps team (different AWS account, we don't have access to the underlying
Lambdas /SQS/ infra)
回答1:
XCOMs to the rescue!
We decided to model the tasks as follows (both tasks are custom operators)
- The
MUX-taskis more like an iterative-sensor: it keeps listening for events on queue and takes some action against each event arriving on queue - All
branch-x.begin-tasks are simple sensors: they listen for publishing of anXCOM(who's name is in a pre-defined specific format)
The workflow runs as follows
- The
MUX-tasklistens for events on queue (listening part is enclosed in afor-loop with as many iterations as the number of branches) - When an event arrives, the
MUX-taskpicks it up; it identifies which 'branch' should be triggered and publishes anXCOMfor the respective branch - The respective branch's
sensorpicks up thatXCOMon it's next poke and the branch starts executing. In effect, branch'ssensormerely acts as a gateway that opens up with an external event (XCOM) and allows execution of branch
Since there are too many sensors (one per branch), we would most likely be employing mode='reschedule' to overcome deadlocks
- Since the described approach relies heavily on polling, we don't deem it to be super efficient.
- A reactive triggering based approach would be more desirable, but we haven't been able to work it out
UPDATE-1
- Looks like 'reactive' approach is achievable if we could model each branch as a separate
DAGand instead of publishingXCOMs for each branch, trigger the branch'sDAGjust likeTriggerDagRunOperatordoes - But since our monolithic
DAGis generated programmatically via complex logic, this change would have been quite hard (lots of code rewrite). So we decided to continue with the poll-based approach and live with few minutes of extra delay in a pipeline that already takes several hours to complete
UPDATE-2
[with reference to UPDATE-1 section of question]
Since our actual implementation required us to just wait for creation of database, we decided to simplify the workflow as follows
- database endpoints were fixed via
DNS(they didn't change every timeAurorasnapshot was restored) - we did away with the
MUX-task(and so also theSQSqueue for Aurora restore lifecycle events) - each branch's begin-task
branch-x.begin-taskwas modelled as a simplesensorthat tried firing a dummy SQL query (SELECT 1) to check if database endpoint has become active or not
来源:https://stackoverflow.com/questions/61304429/conditionally-execute-multiple-branches-one-by-one