问题
I am a data engineer and work with airflow regularly.
When redeploying dags with a new start date the best practice is as shown in the here:
Don’t change start_date + interval: When a DAG has been run, the scheduler database contains instances of the run of that DAG. If you change the start_date or the interval and redeploy it, the scheduler may get confused because the intervals are different or the start_date is way back. The best way to deal with this is to change the version of the DAG as soon as you change the start_date or interval, i.e. my_dag_v1 and my_dag_v1. This way, historical information is also kept about the old version.
However after deleting all previous DAG and task runs I tried to redeploy a dag with a new start date. It worked as expected (with the new start date) for a day, then started to work with the old again
What are the reasons for this? In depth if you can.
回答1:
Airflow maintains all of the information regarding the past runs in a table dag_run
.
When you clear the previous dag runs, these entries are dropped from the database. Hence, airflow treats this dag as a new dag and starts at the specified time.
Airflow checks the last dag execution time (start_date
of last run) and adds the timedelta
object which you have specified in schedule_interval
.
If you are having difficulties even after clearing dag runs, few things you can do:
- Rename the dag as suggested.
- Clear all the dag runs, keep the dag paused. Create a dag run and then turn the dag on. It will run on the scheduled time afterwards.
- The best approach would be to use crontab expression inside
schedule_interval
.
来源:https://stackoverflow.com/questions/57096386/why-does-airflow-changing-start-date-without-renaming-dag