How to stop DAG from backfilling? catchup_by_default=False and catchup=False does not seem to work and Airflow Scheduler from backfilling

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-23 17:49:13

问题


The setting catchup_by_default=False in airflow.cfg does not seem to work. Also adding catchup=False to the DAG doesn't work neither.

Here's how to reproduce the issue. I always start from a clean slate by running airflow resetdb. As soon as I unpause the dag, the tasks start to backfill.

Here's the setup for the dag. I'm just using the tutorial example.

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "start_date": datetime(2018, 9, 16),
    "email": ["airflow@airflow.com"],
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}

dag = DAG("tutorial", default_args=default_args, schedule_interval=timedelta(1), catchup=False)

回答1:


To be clear if you enabled this DAG that you specified when the time now is 2018-10-22T9:00:00.000EDT (which is what, 2018-10-22T13:00:00.000Z) it would be would be started some time after 2018-10-22T13:00:00.000Z with a run date marked 2018-10-21T00:00:00.000Z.

This is not back filling from the start date, but without any prior run, it does "catchup" the most recent completed valid period; I'm not sure why that's been the case in Airflow for a while, but it's documented that catchup=False means create a single run of the very most recent valid period.

If the dagrun run date is further confusing to you, please recall that run dates are the execution_date which is the start of the interval period. The data for the interval is only completely available at the end of the interval period, but Airflow is designed to pass in the start of the period.

Then the next run would start sometime after 2018-10-23T00:00:00.000Z with an execution_date set as 2018-10-22T00:00:00.000Z.

If, on the 22nd or later, you're getting any run date earlier than the 21st, or multiple runs scheduled, then yes catchup=False is not working. But there's no other reports of that being the case in v1.10 or v1-10-stable branch.




回答2:


Like @dlamblin mentioned and as mentioned in the docs too Airflow would create a single DagRun for the most recent valid interval. catchup=False will instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series.

Although there was a BUG when using a timedelta for schedule_interval instead of a CRON expression or CRON preset. This has been fixed in Airflow Master with https://github.com/apache/airflow/pull/8776. We will release Airflow 1.10.11 with this fix.



来源:https://stackoverflow.com/questions/52177418/how-to-stop-dag-from-backfilling-catchup-by-default-false-and-catchup-false-doe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!