Airflow does not backfill latest run

 ̄綄美尐妖づ 提交于 2020-01-11 06:36:23

问题


For some reason, Airflow doesn't seem to trigger the latest run for a dag with a weekly schedule interval.

Current Date:

$ date
$ Tue Aug  9 17:09:55 UTC 2016

DAG:

from datetime import datetime
from datetime import timedelta

from airflow import DAG
from airflow.operators.bash_operator import BashOperator

dag = DAG(
    dag_id='superdag',
    start_date=datetime(2016, 7, 18),
    schedule_interval=timedelta(days=7),
    default_args={
        'owner': 'Jon Doe',
        'depends_on_past': False
    }
)

BashOperator(
    task_id='print_date',
    bash_command='date',
    dag=dag
)

Run scheduler

$ airflow scheduler -d superdag

You'd expect a total of four DAG Runs as the scheduler should backfill for 7/18, 7/25, 8/1, and 8/8. However, the last run is not scheduled.

EDIT 1:

I understand that Vineet although that doesn’t seem to explain my issue.

In my example above, the DAG’s start date is July 18.

  • First DAG Run: July 18
  • Second DAG Run: July 25
  • Third DAG Run: Aug 1
  • Fourth DAG Run: Aug 8 (not run)

Where each DAG Run processes data from the previous week.

Today being Aug 9, I would expect the Fourth DAG Run to have executed with a execution date of Aug 8 which processes data for the last week (Aug 1 until Aug 8) but it doesn’t.


回答1:


Airflow always schedules for the previous period. So if you have a dag that is scheduled to run daily, on Aug 9th, it will schedule a run with execution_date Aug 8th. Similarly if the schedule interval is weekly, then on Aug 9th, it will schedule for 1 week back i.e. Aug 2nd, though this gets run on Aug 9th itself. This is just airflow bookkeeping. You can find this in the airflow wiki (https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls):

Understanding the execution date Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available. This date is available to you in both Jinja and a Python callable's context in many forms as documented here. As a note ds refers to date_string, not date start as may be confusing to some.




回答2:


The similar issue happened to me as well. I solved it by manually run airflow backfill -s start_date -e end_date DAG_NAME where start_date and end_date covers the missing execution_date, in your case, 2016-08-08. For example, airflow backfill -s 2016-08-07 -e 2016-08-09 DAG_NAME



来源:https://stackoverflow.com/questions/38856886/airflow-does-not-backfill-latest-run

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!