I\'m learning Airflow and have a simple quesiton. Below is my DAG called dog_retriever
import airflow
from airflow import DAG
from airflow.opera
I'm adding this answer primarily for anyone who is trying to (or who wants to) call an Airflow workflow DAG from a process and to receive any data that results from the DAG's activity.
It is important to understand that an HTTP POST is required to run a DAG and that the response to this POST is hardcoded in Airflow, i.e. without changes to the Airflow code itself, Airflow will never return anything but a status code and message to the requesting process.
Airflow seems to be used primarily to create data pipelines for ETL (extract, transform, load) workflows, the existing Airflow Operators, e.g. SimpleHttpOperator, can get data from RESTful web services, process it, and write it to databases using other operators, but do not return it in the response to the HTTP POST that runs the workflow DAG.
Even if the operators did return this data in the response, looking at the Airflow source code confirms that the trigger_dag() method doesn’t check for or return it:
apache_airflow_airflow_www_api_experimental_endpoints.py
apache_airflow_airflow_api_client_json_client.py
All it does return is this confirmation message:
Airflow DagRun Message Received in Orchestration Service
Since Airflow is OpenSource, I suppose we could modify the trigger_dag() method to return the data, but then we’d be stuck maintaining the forked codebase, and we wouldn’t be able to use cloud-hosted, Airflow-based services like Cloud Composer on Google Cloud Platform because it wouldn’t include our modification.
Worse, Apache Airflow isn’t even returning its hard-coded status message correctly.
When we POST successfully to the Airflow /dags/{DAG-ID}/dag_runs
endpoint, we receive a ’200 OK” response, not a “201 Created” response as we should. And Airflow “hard-codes” the Content body of the response with its “Created … ” status message. The standard, however, is to return the Uri of the newly created resource in the response header, not in the body … which would leave the body free to return any data produced/aggregated during (or resulting from) this creation.
I attribute this flaw to the “blind” (or what I call “naive”) Agile/MVP-driven approach, which only adds features that are asked for rather than remaining aware of and leaving room for more general utility. Since Airflow is overwhelmingly used to create data pipelines for (and by) data scientists (not software engineers), the Airflow operators can share data with each other using its proprietary, internal XCom feature as @Chengzhi 's helpful answer points out (thank you!) but cannot under any circumstances return data to the requester that kicked off the DAG, i.e. a SimpleHttpOperator can retrieve data from a third-party RESTful service and can share that data with a PythonOperator (via XCom) that enriches, aggregates, and/or transforms it. The PythonOperator can then share its data with a PostgresOperator that stores the result directly in a database. But the result cannot ever be returned to the process that requested that work be done, i.e. our Orchestration service, making Airflow useless for any use case but the one being driven by its current users.
The takeaways here (for me at least) are 1) never to attribute too much expertise to anyone or to any organization. Apache is an important organization with deep and vital roots in software development … but they’re not perfect. And 2) always beware of internal, proprietary solutions. Open, standards-based solutions have been examined and vetted from many different perspectives, not just one.
I lost nearly a week chasing down different ways to do what seemed a very simple and reasonable thing. I hope that this answer will save someone else some time.
So since this is SimpleHttpOperator and the actual json is pushed to XCOM and you can get it from there. Here is the line of code for that action: https://github.com/apache/incubator-airflow/blob/master/airflow/operators/http_operator.py#L87
What you need to do is set xcom_push=True
, so your first t1 will be the following:
t1 = SimpleHttpOperator(
task_id='get_labrador',
method='GET',
http_conn_id='http_default',
endpoint='api/breed/labrador/images',
headers={"Content-Type": "application/json"},
xcom_push=True,
dag=dag)
You should be able to find all JSON with return value
in XCOM, more detail of XCOM can be found at: https://airflow.incubator.apache.org/concepts.html#xcoms