问题
I have a DAG which
- downloads a csv file from cloud storage
- uploads the csv file to a 3rd party via https
The airflow cluster I am executing on uses CeleryExecutor
by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A)
Is it possible to somehow guarantee that both the download and upload operators will be executed on the same airflow worker?
回答1:
For these kinds of use cases we have two solutions:
- Use a network mounted drive that is shared between the two workers so that both the downloading and uploading tasks have access to the same file system
- Use Airflow queue that is worker specific. If there is only one worker listening to this queue you will guarantee that both will have access to the same file system. Note that each worker can listen on multiple queues so you can have it listening on the "default" queue as well as the custom one intended for this task.
回答2:
Put step 1 (the csv download) and step 2 (the csv upload) into a subdag, and then trigger it via the SubDagOperator with the executor
option set to a SequentialExecutor
- this will ensure that steps 1 and 2 run on the same worker.
Here is a working DAG file illustrating that concept (with the actual operations stubbed out as DummyOperators), with the download/upload steps in the context of some larger process:
from datetime import datetime, timedelta
from airflow.models import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.executors.sequential_executor import SequentialExecutor
PARENT_DAG_NAME='subdaggy'
CHILD_DAG_NAME='subby'
def make_sub_dag(parent_dag_name, child_dag_name, start_date, schedule_interval):
dag = DAG(
'%s.%s' % (parent_dag_name, child_dag_name),
schedule_interval=schedule_interval,
start_date=start_date
)
task_download = DummyOperator(
task_id = 'task_download_csv',
dag=dag
)
task_upload = DummyOperator(
task_id = 'task_upload_csv',
dag=dag
)
task_download >> task_upload
return dag
main_dag = DAG(
PARENT_DAG_NAME,
schedule_interval=None,
start_date=datetime(2017,1,1)
)
main_task_1 = DummyOperator(
task_id = 'main_1',
dag = main_dag
)
main_task_2 = SubDagOperator(
task_id = CHILD_DAG_NAME,
subdag=make_sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, main_dag.start_date, main_dag.schedule_interval),
executor=SequentialExecutor(),
dag=main_dag
)
main_task_3 = DummyOperator(
task_id = 'main_3',
dag = main_dag
)
main_task_1 >> main_task_2 >> main_task_3
来源:https://stackoverflow.com/questions/45842564/guarantee-that-some-operators-will-be-executed-on-the-same-airflow-worker