问题
I have been learning airflow and writing DAGs for an ETL pipeline. It involves using the AWS environment (S3, Redshift). It deals with copying data from one bucket to another after storing it in redshift. I am storing bucket names and prefixes as Variables in airflow for which you have to open the GUI and add them manually.
Which is the most safest and widely used practice in the industry out of the following options
- Can we use
airflow.cfg
to store our variables (bucket names) and access them in our DAGs? - Use a custom configuration file and parse its contents using
configparser
- Use the GUI to add variables
回答1:
To summary: you can use airflow cli to perform an import operation of variables from a json file. You could use the following command airflow variables -i
[1] and build it via airflow CICD pipeline or manually run it. That should handle the insert/update case. For deletion, you can call airflow variables -x
explicitly, I don't think currently you can do a batch delete in airflow now.
You can have a JSON file looks like the following format with key value:
{
"foo1": "bar1",
"foo2": "bar2"
}
One thing to note here: you can treat the variable as key-value storage, so make sure you don't have duplicated keys when you import (otherwise you might override it with unexpected result)
[1] airflow.apache.org/cli.html#variables
回答2:
Airflow uses SQLAlchemy models to play around with entities like Connection
, Variable
, Pool
etc. Furthermore, it doesn't try to hide that from end-user in any way, meaning that you are free to manipulate these entities by exploiting the underlying SQLAlchemy magic.
If you intend to modify Variables programmatically (from within an Airflow task), take inspiration from here
Other helpful links for reference
- bin/cli.py
- experimental/pool.py
来源:https://stackoverflow.com/questions/57468115/how-to-create-update-and-delete-airflow-variables-without-using-the-gui