Want to create airflow tasks that are downstream of the current task

Deadly 提交于 2020-01-23 17:53:05

问题


I'm mostly brand new to airflow.

I have a two step process:

  1. Get all files that match a criteria
  2. Uncompress the files

The files are half a gig compressed, and 2 - 3 gig when uncompressed. I can easily have 20+ files to process at a time, which means uncompressing all of them can run longer than just about any reasonable timeout

I could use XCom to get the results of step 1, but what I'd like to do is something like this:

def processFiles (reqDir, gvcfDir, matchSuffix):
    theFiles = getFiles (reqDir, gvcfDir, matchSuffix)

    for filePair in theFiles:
        task = PythonOperator (task_id = "Uncompress_" + os.path.basename (theFile), 
                                python_callable = expandFile, 
                                op_kwargs = {'theFile': theFile}, 
                                dag = dag)
task.set_upstream (runThis)

The problem is that "runThis" is the PythonOperator that called processFiles, so it has to be declared after processFiles.

Is there any way to make this work?

Is this the reason that XCom exists, and I should dump this approach and go with XCom?


回答1:


Regarding your proposed solution, I don't think you can use XComs to achieve this, as they are only available to instances and not when you define the DAG (to the best of my knowledge).

You can however use a SubDAG to achieve your objective. The SubDagOperator gets a function which is going to be invoked when the operator is going to be executed and that generates a DAG, giving you a chance to dynamically create a sub-section of your workflow.

You can test the idea using this simple example, which generates a random of tasks every time it's invoked:

import airflow
from builtins import range
from random import randint
from airflow.operators.bash_operator import BashOperator
from airflow.operators.subdag_operator import SubDagOperator
from airflow.models import DAG

args = {
    'owner': 'airflow',
    'start_date': airflow.utils.dates.days_ago(2)
}

dag = DAG(dag_id='dynamic_dag', default_args=args)

def generate_subdag(parent_dag, dag_id, default_args):
    # pseudo-randomly determine a number of tasks to be created
    n_tasks = randint(1, 10)

    subdag = DAG(
        '%s.%s' % (parent_dag.dag_id, dag_id),
        schedule_interval=parent_dag.schedule_interval,
        start_date=parent_dag.start_date,
        default_args=default_args
    )
    for i in range(n_tasks):
        i = str(i)
        task = BashOperator(task_id='echo_%s' % i, bash_command='echo %s' % i, dag=subdag)

    return subdag

subdag_dag_id = 'dynamic_subdag'

SubDagOperator(
    subdag=generate_subdag(dag, subdag_dag_id, args),
    task_id=subdag_dag_id,
    dag=dag
)

If you execute this you'll notice that in different runs SubDAGs are likely to contain a different number of tasks (I tested this with version 1.8.0). You can access the SubDAG view on the WebUI by accessing the graph view, clicking on the grey SubDAG node and then on "Zoom into SubDAG".

You can use this concept by listing files and creating one task for each of those instead of just generating them in a random number like in the example. The tasks themselves can be arranged in parallel (as I did), sequentially or in any valid directed acyclic layout.



来源:https://stackoverflow.com/questions/48197709/want-to-create-airflow-tasks-that-are-downstream-of-the-current-task

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!