问题
Note
- Please read and understand the question thoroughly
- It cannot be solved by simple BranchPythonOperator / ShortCircuitOperator
We have an unusual multiplexer-like use-case in our workflow
+-----------------------+
| |
+------------>+ branch-1.begin-task |
| | |
| +-----------------------+
|
|
| +-----------------------+
| | |
+------------>+ branch-2.begin-task |
| | |
+------------+ | +-----------------------+
| | |
| MUX-task +----+ +
| | | |
+------------+ |
| |
+- -- -- -- ->
| |
|
| |
| +
|
| +-----------------------+
| | |
+------------>+ branch-n.begin-task |
| |
+-----------------------+
The flow is expected to work as follows
MUX-task
listens for events on an external queue (single queue)- each event on queue triggers execution of one of the branches (branch-n.begin-task)
- one-by-one, as events arrive, the MUX-task must trigger execution of respective branch
- once all branches have been triggered, the MUX-task completes
Assumptions
- Exactly
n
events arrive on queue, one for triggering each branch n
is dynamically-known: it's value is defined in a Variable
Limitations
- The external queue where events arrive is only one
- we can't have
n
queues (one per branch) since branches grow with time (n is dynamically defined)
We are not able to come up with a solution within Airflow's set of operators and sensors (or any such thing available out-of-the-hood in Airflow
) to build this
Sensor
s can be used for listening events on external queue; but we have to listen for multiple events, not one- BranchPythonOperator can be used to trigger execution of a single branch out of many, but it immediately marks remaining branches as skipped
Primary bottleneck
Because of the 2nd limitation above, even a custom-operator combining functionality of a Sensor
and BranchPythonOperator
won't work.
We have tried to brainstorm around a fancy combination of Sensors
, DummyOperator
and trigger_rules to achieve this, but have had no success thus far.
Is this doable in Airflow?
UPDATE-1
Here's some background info to understand the context of workflow
- we have an ETL pipeline to sync
MySQL
tables (across multipleAurora
databases) to our data-lake - to overcome the impact of our sync pipeline on production databases, we have decided to do this
- for each database, create a snapshot (restore AuroraDB cluster from last backup)
- run
MySQL
sync pipeline using that snapshot - at then end of sync, terminate the snapshot (
AuroraDB
cluster)
- the snapshot lifecycle events of
Aurora
snapshot restore process are published to anSQS
queue- single queue for all databases
- this setup was done by our DevOps team (different AWS account, we don't have access to the underlying
Lambda
s /SQS
/ infra)
回答1:
XCOM
s to the rescue!
We decided to model the tasks as follows (both tasks are custom operator
s)
- The
MUX-task
is more like an iterative-sensor
: it keeps listening for events on queue and takes some action against each event arriving on queue - All
branch-x.begin-task
s are simple sensors: they listen for publishing of anXCOM
(who's name is in a pre-defined specific format)
The workflow runs as follows
- The
MUX-task
listens for events on queue (listening part is enclosed in afor
-loop with as many iterations as the number of branches) - When an event arrives, the
MUX-task
picks it up; it identifies which 'branch' should be triggered and publishes anXCOM
for the respective branch - The respective branch's
sensor
picks up thatXCOM
on it's next poke and the branch starts executing. In effect, branch'ssensor
merely acts as a gateway that opens up with an external event (XCOM
) and allows execution of branch
Since there are too many sensors (one per branch), we would most likely be employing mode='reschedule' to overcome deadlocks
- Since the described approach relies heavily on polling, we don't deem it to be super efficient.
- A reactive triggering based approach would be more desirable, but we haven't been able to work it out
UPDATE-1
- Looks like 'reactive' approach is achievable if we could model each branch as a separate
DAG
and instead of publishingXCOM
s for each branch, trigger the branch'sDAG
just likeTriggerDagRunOperator
does - But since our monolithic
DAG
is generated programmatically via complex logic, this change would have been quite hard (lots of code rewrite). So we decided to continue with the poll-based approach and live with few minutes of extra delay in a pipeline that already takes several hours to complete
UPDATE-2
[with reference to UPDATE-1 section of question]
Since our actual implementation required us to just wait for creation of database, we decided to simplify the workflow as follows
- database endpoints were fixed via
DNS
(they didn't change every timeAurora
snapshot was restored) - we did away with the
MUX-task
(and so also theSQS
queue for Aurora restore lifecycle events) - each branch's begin-task
branch-x.begin-task
was modelled as a simplesensor
that tried firing a dummy SQL query (SELECT 1
) to check if database endpoint has become active or not
来源:https://stackoverflow.com/questions/61304429/conditionally-execute-multiple-branches-one-by-one