airflow

Airflow: creating & passing list of tables via XCOM (without storing as a file on a drive) and setting up correct dependencies?

落花浮王杯 提交于 2021-01-29 07:15:37
问题 Here's the expected flow and dependency setting that I want to achieve: START===>Create table list(only once when DAG is triggered)===> Read & pass table names (via XCOM) ===> Create individual tasks dynamically for each table in the list===> Print table name ===>END Here's the sample code flow: start = DummyOperator( task_id = 'start', dag=dag ) end = DummyOperator( task_id = 'end', dag=dag ) #create table list: def create_source_table_list(dsn, uid, pwd, exclude_table_list, **kwargs): try:

How to run airflow with CeleryExecutor on a custom docker image

依然范特西╮ 提交于 2021-01-29 06:32:13
问题 I am adding airflow to a web application that manually adds a directory containing business logic to the PYTHON_PATH env var, as well as does additional system-level setup that I want to be consistent across all servers in my cluster. I've been successfully running celery for this application with RMQ as the broker and redis as the task results backend for awhile, and have prior experience running Airflow with LocalExecutor . Instead of using Pukel's image, I have a an entry point for a base

How to ignore an unknown column when loading to bigQuery using Airflow?

↘锁芯ラ 提交于 2021-01-29 05:21:08
问题 I'm loading data from Google Storage to bigQuery using GoogleCloudStorageToBigQueryOperator It may be that the Json file will have more columns than what I defined. In that case I want the load job continue - simply ignore this unrecognized column. I tried to use the ignore_unknown_values argument but it didn't make any difference. My operator: def dc(): return [ { "name": "id", "type": "INTEGER", "mode": "NULLABLE" }, { "name": "storeId", "type": "INTEGER", "mode": "NULLABLE" }, ... ] gcs_to

Airflow - Branching join operators

夙愿已清 提交于 2021-01-28 22:15:24
问题 I am trying to join branching operators in Airflow I did this : op1>>[op2,op3,op4] op2>>op5 op3>>op6 op4>>op7 [op5,op6,op7]>>op8 It gives a schema like this with relations between op2, op3, op4 and op8. How do I get this: 回答1: can you be more clear ? I tried using your chain function and I can do what you wanted. 回答2: First of all: can you be more specific as to the exact code you use to set the relationships between the tasks? Second: you could try using the chain function. If you look at

Airflow ModuleNotFoundError: No module named 'pyspark'

╄→гoц情女王★ 提交于 2021-01-28 21:12:13
问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

Why is Airflow crashing with INFO - “Task exited with return code -9”?

六眼飞鱼酱① 提交于 2021-01-28 21:09:44
问题 I have a big DAG running, however it's stopping with just that message, I couldn't figure out what is the error on the Airflow docs. If that makes difference: My Airflow is running in a Rancher with a Helm chart. 回答1: This is usually out of memory exception I think. 回答2: I thought about a way to reduce the size of all the memory possibilities. And the first try was load some number of lines instead of all the lines: serverCursor = conn.cursor("serverCursor") serverCursor.execute(f'''select *

Can some provide me with the schema to recreate dag_run table in airflow-db.?

亡梦爱人 提交于 2021-01-28 12:37:11
问题 I have a google cloud composer environment on GCP and I accidentally deleted the dag_runs table due to which airflow_scheduler kept on crashing and the airflow web-server would not come up. I was able to re-create the dag_run table in airflow-db which stopped the crashing, but i think i did not get the schema right as i get the below error when i manually trigger a dag on airflow webserver. Ooops. ____/ ( ( ) ) \___ /( ( ( ) _ )) ) )\ (( ( )( ) ) ( ) ) ((/ ( _( ) ( _) ) ( () ) ) ( ( ( (_) ((

Can some provide me with the schema to recreate dag_run table in airflow-db.?

随声附和 提交于 2021-01-28 12:36:46
问题 I have a google cloud composer environment on GCP and I accidentally deleted the dag_runs table due to which airflow_scheduler kept on crashing and the airflow web-server would not come up. I was able to re-create the dag_run table in airflow-db which stopped the crashing, but i think i did not get the schema right as i get the below error when i manually trigger a dag on airflow webserver. Ooops. ____/ ( ( ) ) \___ /( ( ( ) _ )) ) )\ (( ( )( ) ) ( ) ) ((/ ( _( ) ( _) ) ( () ) ) ( ( ( (_) ((

Airflow Cluster Policy is not getting invoked

廉价感情. 提交于 2021-01-28 11:15:53
问题 I am trying to setup and understand custom policy. Not sure what I am doing wrong however, following this is not working. Airflow Version : 1.10.10 Expected result: it should throw exception if I try to run DAG with default_owner Actual Result: no such exception /root/airflow/config/airflow_local_settings.py class PolicyError(Exception): pass def cluster_policy(task): print("task_instance_mutation_hook") raise PolicyError def task_instance_mutation_hook(ti): print("task_instance_mutation_hook

Guarantee that some operators will be executed on the same airflow worker

痞子三分冷 提交于 2021-01-28 07:33:15
问题 I have a DAG which downloads a csv file from cloud storage uploads the csv file to a 3rd party via https The airflow cluster I am executing on uses CeleryExecutor by default, so I'm worried that at some point when I scale up the number of workers, these tasks may be executed on different workers. eg. worker A does the download, worker B tries to upload, but doesn't find the file (because it's on worker A) Is it possible to somehow guarantee that both the download and upload operators will be