airflow

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

亡梦爱人 提交于 2021-01-29 20:34:26
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

Conditionally execute multiple branches one by one

痞子三分冷 提交于 2021-01-29 20:10:22
问题 Note Please read and understand the question thoroughly It cannot be solved by simple BranchPythonOperator / ShortCircuitOperator We have an unusual multiplexer-like use-case in our workflow +-----------------------+ | | +------------>+ branch-1.begin-task | | | | | +-----------------------+ | | | +-----------------------+ | | | +------------>+ branch-2.begin-task | | | | +------------+ | +-----------------------+ | | | | MUX-task +----+ + | | | | +------------+ | | | +- -- -- -- -> | | | | |

Cloud Composer + Airflow: Setting up DAWs to trigger on HTTP (or should I use Cloud Functions?)

时间秒杀一切 提交于 2021-01-29 19:02:36
问题 Ultimately, what I want to do is have a Python script that runs whenever a HTTP request is created, dynamically. It'd be like: App 1 runs and sends out a webhook, Python script catches the webhook immediately and does whatever it does. I saw that you could do this in GCP with Composer and Airflow. But I'm having several issues following these instrutions https://cloud.google.com/composer/docs/how-to/using/triggering-with-gcf: Running this in Cloud Shell to grant blob signing permissions:

Airflow can't pickle _thread._local objects

爱⌒轻易说出口 提交于 2021-01-29 18:48:25
问题 I am currently creating an engine in my DAG and passing this sqlalchemy engine as a parameter to PythonOperator to execute some database work. e.g. PythonOperator(python_callable=my_callable, op_args=[engine], provide_context=True, task_id = 'my_task', dag=dag) When I try to clear the status of tasks, I get an error File "/opt/conda/lib/python3.7/copy.py", line 169, in deepcopy rv = reductor(4) TypeError: can't pickle _thread._local objects This is most likely because you can't pickle engine

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

不羁的心 提交于 2021-01-29 16:35:03
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

Optimizing Airflow Task that transfers data from BigQuery into MongoDB

风流意气都作罢 提交于 2021-01-29 16:32:28
问题 I need to improve the performance of an Airflow task that transfers data from BigQuery to MongoDB. The relevant task in my DAG uses a PythonOperator , and simply calls the following python function to transfer a single table/collection: def transfer_full_table(table_name): start_time = time.time() # (1) Connect to BigQuery + Mongo DB bq = bigquery.Client() cluster = MongoClient(MONGO_URI) db = cluster["dbname"] print(f'(1) Connected to BQ + Mongo: {round(time.time() - start_time, 5)}') # (2)-

Airflow trigger DAG anytime after a google sheet is being updated

六月ゝ 毕业季﹏ 提交于 2021-01-29 14:30:58
问题 Is there any way I can schedule a DAG to be triggered right after a google sheet is being updated? Not sure if I get any answer from this doc : https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/suite/hooks/sheets/index.html 回答1: @Alejandro's direction is right but just expanding on to his answer. You can use HttpSensor operator to do a get request to sheet file by google drive api HttpSensor( task_id='http_sensor_check', http_conn_id='http_default', endpoint='https://www

Using the output of one Python task and using as the input to another Python Task on Airflow

你离开我真会死。 提交于 2021-01-29 13:21:33
问题 So I'm creating a data flow with Apache Airflow for grabbing some data that's stored in a Pandas Dataframe and then storing it into MongoDB. So I have two python methods, one for fetching the data and returning the dataframe and the other for storing it into the relevant database. How do I take the output of one task and feed it as the input to another task? This is what I have so far (summarized and condensed version) I looked into the concept of xcom pull and push and that's what I

Airflow set task instance status as skipped programmatically

眉间皱痕 提交于 2021-01-29 10:15:53
问题 I have list that I loop to create the tasks. The list are static as far as size. for counter, account_id in enumerate(ACCOUNT_LIST): task_id = f"bash_task_{counter}" if account_id: trigger_task = BashOperator( task_id=task_id, bash_command="echo hello there", dag=dag) else: trigger_task = BashOperator( task_id=task_id, bash_command="echo hello there", dag=dag) trigger_task.status = SKIPPED # is there way to somehow set status of this to skipped instead of having a branch operator? trigger

Does airflow require mysql?

蓝咒 提交于 2021-01-29 08:02:24
问题 I am trying to upgrade our version of airflow to 1.10.0. When I do, I get an error that complains it cannot connect to mysql: worker_1 | sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2002, 'Can\'t connect to local MySQL server through socket \'/var/run/mysqld/mysqld.sock\' (2 "No such file or directory")') (Background on this error at: http://sqlalche.me/e/e3q8) When I try to remove mysql from our systems altogether, I get the following instead: scheduler_1 | [2018-10