Schedule a DAG in airflow to run for every 5 minutes , starting from today i.e., 2019-12-18

只谈情不闲聊 提交于 2020-06-29 06:00:54

问题


I am trying to run a DAG for every 5 minutes starting from today(2019-12-18). I defined my start date as start_date:dt.datetime(2019, 12, 18, 10, 00, 00) and schedule interval as schedule_interval= '*/5 * * * *' . When I start the airflow scheduler I don't see any of my tasks running.

But when I modify the start_date as start_date:dt.datetime(2019, 12, 17, 10, 00, 00) i.e., Yesterdays date, the DAG runs continuously like for every 10 seconds but not 5 minutes.

I think the solution to this problem is to set the start_date correctly, but I could not find the perfect solution for this. Please help me!

This is my code.

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
import datetime as dt
from airflow.operators.python_operator import PythonOperator

def print_world():
   print('world')


default_args = {
    'owner': 'bhanuprakash',
    'depends_on_past': False,
    'start_date': dt.datetime(2019, 12, 18, 10, 00, 00),
    'email': ['bhanuprakash.uchula@techwave.net'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': dt.timedelta(minutes=5)
}

with DAG('dag_today',
    default_args=default_args,
    schedule_interval= '*/5 * * * *'
    ) as dag:


    print_hello = BashOperator(task_id='print_hello',
        bash_command='gnome-terminal')


    sleep = BashOperator(task_id='sleep',
        bash_command='sleep 5')


    print_world = PythonOperator(task_id='print_world',
        python_callable=print_world)

print_hello >> sleep >> print_world

回答1:


The datetime object you are passing to Airflow isn't timezone aware. Airflow uses UTC internally. The naive datetime object you are passing to Airflow may not be aligned with the scheduler's notion of time and this could be why the DAG isn't being scheduled to run midnight "today" (2019-12-18).

Instead of passing a naive datetime object like this:

'start_date': dt.datetime(2019, 12, 18, 10, 00, 00)

Try using pendulum to make your DAG timezone aware:

import pendulum

...
'start_date': pendulum.datetime(year=2019, month=12, day=10).astimezone('YOUR TIMEZONE'), # See list of tz database time zones here -> https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

The docs (https://airflow.apache.org/docs/stable/timezone.html) are quite useful got getting tips on how to handle datetimes in Airflow.

As for your other question on run frequency ... DAG runs are designed to do "Catchup" on all the intervals between your start and end date by default. To disable this behavior you will need to add catchup=False when instantiating your DAG.

From the Airflow docs

Backfill and Catchup

An Airflow DAG with a start_date, possibly an end_date, and a schedule_interval defines a series of intervals which the scheduler turn into individual Dag Runs and execute. A key capability of Airflow is that these DAG Runs are atomic, idempotent items, and the scheduler, by default, will examine the lifetime of the DAG (from start to end/now, one interval at a time) and kick off a DAG Run for any interval that has not been run (or has been cleared). This concept is called Catchup.

If your DAG is written to handle its own catchup (IE not limited to the interval, but instead to “Now” for instance.), then you will want to turn catchup off (Either on the DAG itself with dag.catchup = False) or by default at the configuration file level with catchup_by_default = False. What this will do, is to instruct the scheduler to only create a DAG Run for the most current instance of the DAG interval series.

I'd suggest going over the two pages I linked to get a better intuition of the basic Airflow concepts.



来源:https://stackoverflow.com/questions/59391110/schedule-a-dag-in-airflow-to-run-for-every-5-minutes-starting-from-today-i-e

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!