Why does Airflow changing start_date without renaming dag?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-25 01:13:36

问题


I am a data engineer and work with airflow regularly.

When redeploying dags with a new start date the best practice is as shown in the here:

Don’t change start_date + interval: When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG as soon as you change the start_date or interval, i.e. my_dag_v1 and my_dag_v1. This way, historical information is also kept about the old version.

However after deleting all previous DAG and task runs I tried to redeploy a dag with a new start date. It worked as expected (with the new start date) for a day, then started to work with the old again

What are the reasons for this? In depth if you can.


回答1:


Airflow maintains all of the information regarding the past runs in a table dag_run.

When you clear the previous dag runs, these entries are dropped from the database. Hence, airflow treats this dag as a new dag and starts at the specified time.

Airflow checks the last dag execution time (start_date of last run) and adds the timedelta object which you have specified in schedule_interval.

If you are having difficulties even after clearing dag runs, few things you can do:

  1. Rename the dag as suggested.
  2. Clear all the dag runs, keep the dag paused. Create a dag run and then turn the dag on. It will run on the scheduled time afterwards.
  3. The best approach would be to use crontab expression inside schedule_interval.


来源:https://stackoverflow.com/questions/57096386/why-does-airflow-changing-start-date-without-renaming-dag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!