apache-airflow

Running an Airflow DAG every X minutes

与世无争的帅哥 提交于 2020-01-02 03:31:06
问题 I am using airflow on an EC2 instance using the LocalScheduler option. I've invoked airflow scheduler and airflow webserver and everything seems to be running fine. That said, after supplying the cron string to schedule_interval for "do this every 10 minutes," '*/10 * * * *' , the job continue to execute every 24 hours by default. Here's the header of the code: from datetime import datetime import os import sys from airflow.models import DAG from airflow.operators.python_operator import

how do I use the --conf option in airflow

混江龙づ霸主 提交于 2020-01-01 09:26:06
问题 I am trying to run a airflow DAG and need to pass some parameters for the tasks. How do I read the JSON string passed as the --conf parameter in the command line trigger_dag command, in the python DAG file. ex: airflow trigger_dag 'dag_name' -r 'run_id' --conf '{"key":"value"}' 回答1: Two ways. From inside a template field or file: {{ dag_run.conf['key'] }} Or when context is available, e.g. within a python callable of the PythonOperator : context['dag_run'].conf['key'] 来源: https:/

Airflow: pass {{ ds }} as param to PostgresOperator

青春壹個敷衍的年華 提交于 2019-12-31 20:33:08
问题 i would like to use execution date as parameter to my sql file: i tried dt = '{{ ds }}' s3_to_redshift = PostgresOperator( task_id='s3_to_redshift', postgres_conn_id='redshift', sql='s3_to_redshift.sql', params={'file': dt}, dag=dag ) but it doesn't work. 回答1: dt = '{{ ds }}' Doesn't work because Jinja (the templating engine used within airflow) does not process the entire Dag definition file. For each Operator there are fields which Jinja will process, which are part of the definition of the

airflow pass parameter from cli

两盒软妹~` 提交于 2019-12-31 12:43:07
问题 Is there a way to pass a parameter to: airflow trigger_dag dag_name {param} ? I have a script that monitors a directory for files - when a file gets moves into the target directory I want to trigger the dag passing as a parameter the file path. 回答1: you can pass it like this: airflow trigger_dag --conf {"file_variable": "/path/to/file"} dag_id Then in your dag, you can access this variable using templating as follows: {{ dag_run.conf.file_variable }} If this doesn't work, sharing a simple

How to implement polling in Airflow?

两盒软妹~` 提交于 2019-12-24 18:41:29
问题 I want to use Airflow to implement data flows that periodically poll external systems (ftp servers, etc), check for new files matching certain conditions, and then run a bunch of tasks for those files. Now, I'm a newbie to Airflow and read that Sensors are something you would use for this kind of a case, and I actually managed to write a sensor that works ok when I run "airflow test" for it. But I'm a bit confused regarding the relation of poke_interval for the sensor and the DAG scheduling.

airflow.exceptions.AirflowException: Cycle detected in DAG. Faulty task

别等时光非礼了梦想. 提交于 2019-12-24 07:46:52
问题 i am running the airflow pipeline but codes looks seems good but actually i'm getting the airflow.exceptions.AirflowException: Cycle detected in DAG. Faulty task: can u please help to resolve this issue 回答1: This can happen due to duplicate task_id'a in multiple tasks. 回答2: Without the code, it's kind of hard to help you. However, this means that you have a loop in your DAG. Generally, thie error happens when one of your task has a downstream task whose own downstream chain includes it again

Manual DAG run set individual task state

妖精的绣舞 提交于 2019-12-24 00:59:56
问题 I have a DAG without a schedule (it is run manually as needed). It has many tasks. Sometimes I want to 'skip' some initial tasks by changing the task state to SUCCESS manually. Changing task state of a manually executed DAG fails, seemingly because of a bug in parsing the execution_date. Is there another way to individually setting task states for a manually executed DAG? Example run below. The execution date of the Task is 01-13T17:27:13.130427, and I believe the milliseconds are not being

Airflow - Get start time of dag run

烂漫一生 提交于 2019-12-24 00:46:07
问题 Is it possible to get the actual start time of a dag in Airflow? By start time I mean the exact time the first task of a dag starts running. I know I can use macros to get the execution date. If the job is ran using trigger_dag this is what I would call a start time but if the job is ran on a daily schedule then {{ execution_date }} returns yesterdays date. I have also tried to place datetime.now().isoformat() in the body of the dag code and then pass it to a task but this seems to return the

Use XCom to exchange data between classes?

邮差的信 提交于 2019-12-23 15:30:08
问题 I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine: from datetime import datetime import os import sys from airflow.models import DAG from airflow.operators.python_operator import PythonOperator import ds_dependencies SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH') if SCRIPT_PATH: sys.path.insert(0, SCRIPT_PATH) from table_builder import OnlineOfflinePreprocess else: print('Define MARKETING_PREPROC_PATH value in

Get all Airflow Leaf Nodes/Tasks

我怕爱的太早我们不能终老 提交于 2019-12-23 10:13:07
问题 I want to build something where I need to capture all of the leaf tasks and add a downstream dependency to them to make a job complete in our database. Is there an easy way to find all the leaf nodes of a DAG in Airflow? 回答1: Use upstream_task_ids and downstream_task_ids @property from BaseOperator def get_start_tasks(dag: DAG) -> List[BaseOperator]: # returns list of "head" / "root" tasks of DAG return [task for task in dag.tasks if not task.upstream_task_ids] def get_end_tasks(dag: DAG) ->