airflow | 易学教程

Airflow: creating & passing list of tables via XCOM (without storing as a file on a drive) and setting up correct dependencies?

阅读更多关于 Airflow: creating & passing list of tables via XCOM (without storing as a file on a drive) and setting up correct dependencies?

问题 Here's the expected flow and dependency setting that I want to achieve: START===>Create table list(only once when DAG is triggered)===> Read & pass table names (via XCOM) ===> Create individual tasks dynamically for each table in the list===> Print table name ===>END Here's the sample code flow: start = DummyOperator( task_id = 'start', dag=dag ) end = DummyOperator( task_id = 'end', dag=dag ) #create table list: def create_source_table_list(dsn, uid, pwd, exclude_table_list, **kwargs): try:

How to run airflow with CeleryExecutor on a custom docker image

阅读更多关于 How to run airflow with CeleryExecutor on a custom docker image

问题 I am adding airflow to a web application that manually adds a directory containing business logic to the PYTHON_PATH env var, as well as does additional system-level setup that I want to be consistent across all servers in my cluster. I've been successfully running celery for this application with RMQ as the broker and redis as the task results backend for awhile, and have prior experience running Airflow with LocalExecutor . Instead of using Pukel's image, I have a an entry point for a base

How to ignore an unknown column when loading to bigQuery using Airflow?

阅读更多关于 How to ignore an unknown column when loading to bigQuery using Airflow?

问题 I'm loading data from Google Storage to bigQuery using GoogleCloudStorageToBigQueryOperator It may be that the Json file will have more columns than what I defined. In that case I want the load job continue - simply ignore this unrecognized column. I tried to use the ignore_unknown_values argument but it didn't make any difference. My operator: def dc(): return [ { "name": "id", "type": "INTEGER", "mode": "NULLABLE" }, { "name": "storeId", "type": "INTEGER", "mode": "NULLABLE" }, ... ] gcs_to

Airflow - Branching join operators

阅读更多关于 Airflow - Branching join operators

问题 I am trying to join branching operators in Airflow I did this : op1>>[op2,op3,op4] op2>>op5 op3>>op6 op4>>op7 [op5,op6,op7]>>op8 It gives a schema like this with relations between op2, op3, op4 and op8. How do I get this: 回答1: can you be more clear ? I tried using your chain function and I can do what you wanted. 回答2: First of all: can you be more specific as to the exact code you use to set the relationships between the tasks? Second: you could try using the chain function. If you look at

Airflow ModuleNotFoundError: No module named 'pyspark'

阅读更多关于 Airflow ModuleNotFoundError: No module named 'pyspark'

问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

Why is Airflow crashing with INFO - “Task exited with return code -9”?

阅读更多关于 Why is Airflow crashing with INFO - “Task exited with return code -9”?

问题 I have a big DAG running, however it's stopping with just that message, I couldn't figure out what is the error on the Airflow docs. If that makes difference: My Airflow is running in a Rancher with a Helm chart. 回答1: This is usually out of memory exception I think. 回答2: I thought about a way to reduce the size of all the memory possibilities. And the first try was load some number of lines instead of all the lines: serverCursor = conn.cursor("serverCursor") serverCursor.execute(f'''select *

Can some provide me with the schema to recreate dag_run table in airflow-db.?

阅读更多关于 Can some provide me with the schema to recreate dag_run table in airflow-db.?

问题 I have a google cloud composer environment on GCP and I accidentally deleted the dag_runs table due to which airflow_scheduler kept on crashing and the airflow web-server would not come up. I was able to re-create the dag_run table in airflow-db which stopped the crashing, but i think i did not get the schema right as i get the below error when i manually trigger a dag on airflow webserver. Ooops. ____/ ( ( ) ) \___ /( ( ( ) _ )) ) )\ (( ( )( ) ) ( ) ) ((/ ( _( ) ( _) ) ( () ) ) ( ( ( (_) ((

Can some provide me with the schema to recreate dag_run table in airflow-db.?

阅读更多关于 Can some provide me with the schema to recreate dag_run table in airflow-db.?

Airflow Cluster Policy is not getting invoked

阅读更多关于 Airflow Cluster Policy is not getting invoked

问题 I am trying to setup and understand custom policy. Not sure what I am doing wrong however, following this is not working. Airflow Version : 1.10.10 Expected result: it should throw exception if I try to run DAG with default_owner Actual Result: no such exception /root/airflow/config/airflow_local_settings.py class PolicyError(Exception): pass def cluster_policy(task): print("task_instance_mutation_hook") raise PolicyError def task_instance_mutation_hook(ti): print("task_instance_mutation_hook

Guarantee that some operators will be executed on the same airflow worker

阅读更多关于 Guarantee that some operators will be executed on the same airflow worker

问题 I have a DAG which downloads a csv file from cloud storage uploads the csv file to a 3rd party via https The airflow cluster I am executing on uses CeleryExecutor by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A) Is it possible to somehow guarantee that both the download and upload operators will be