Airflow : Passing a dynamic value to Sub DAG operator

你说的曾经没有我的故事 提交于 2020-01-12 08:08:07

问题


I am new to Airflow.
I have come across a scenario, where Parent DAG need to pass some dynamic number (let's say n) to Sub DAG.
Where as SubDAG will use this number to dynamically create n parallel tasks.

Airflow documentation doesn't cover a way to achieve this. So I have explore couple of ways :

Option - 1(Using xcom Pull)

I have tried to pass as a xcom value, but for some reason SubDAG is not resolving to the passed value.

Parent Dag File

def load_dag(**kwargs):
    number_of_runs = json.dumps(kwargs['dag_run'].conf['number_of_runs'])
    dag_data = json.dumps({
        "number_of_runs": number_of_runs
    })
    return dag_data

# ------------------ Tasks ------------------------------
load_config = PythonOperator(
    task_id='load_config',
    provide_context=True,
    python_callable=load_dag,
    dag=dag)


t1 = SubDagOperator(
    task_id=CHILD_DAG_NAME,
    subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config') }}'" ),
    default_args=default_args,
    dag=dag,
)

Sub Dag File

def sub_dag(parent_dag_name, child_dag_name, args, num_of_runs):
    dag_subdag = DAG(
        dag_id='%s.%s' % (parent_dag_name, child_dag_name),
        default_args=args,
        schedule_interval=None)

    variabe_names = {}

    for i in range(num_of_runs):
        variabe_names['task' + str(i + 1)] =  DummyOperator(
        task_id='dummy_task',
        dag=dag_subdag,
    )

    return dag_subdag

Option - 2

I have also tried to pass number_of_runs as a global variable, which was not working.

Option - 3

Also we tried to write this value to a data file. But sub DAG is throwing File doesn't exist error. This might be because we are dynamically generating this file.

Can some one help me with this.


回答1:


I've done it with Option 3. The key is to return a valid dag with no tasks, if the file does not exist. So load_config will generate a file with your number of tasks or more information if needed. Your subdag factory would look something like:

def subdag(...):
    sdag = DAG('%s.%s' % (parent, child), default_args=args, schedule_interval=timedelta(hours=1))
    file_path = "/path/to/generated/file"
    if os.path.exists(file_path):
        data_file = open(file_path)
        list_tasks = data_file.readlines()
        for task in list_tasks:
            DummyOperator(
                  task_id='task_'+task,
                  default_args=args,
                  dag=sdag,
            )
    return sdag

At dag generation you will see a subdag with No tasks. At dag execution, after load_config is done, you can see you dynamically generated subdag




回答2:


Option 1 should work if you just change the call to xcom_pull to include the dag_id of the parent dag. By default the xcom_pull call will look for the task_id 'load_config' in its own dag which doesnt exist.

so change the x_com call macro to:

subdag=sub_dag(PARENT_DAG_NAME, CHILD_DAG_NAME, default_args, "'{{ ti.xcom_pull(task_ids='load_config', dag_id='" + PARENT_DAG_NAME + "' }}'" ),



回答3:


If the filename you are writing to is not dynamic (e.g. you are writing over the same file over and over again for each task instance), Jaime's answer will work:

file_path = "/path/to/generated/file"

But if you need a unique filename or want different content written to the file by each task instance for tasks executed in parallel, airflow will not work for this case, since there is no way to pass the execution date or variable outside of a template. Take a look at this post.




回答4:


Take a look at my answer here, in which I describe a way to create a task dynamically based on the results of a previously executed task using xcoms and subdags.



来源:https://stackoverflow.com/questions/44365716/airflow-passing-a-dynamic-value-to-sub-dag-operator

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!