Airflow s3 connection using UI

后端 未结 8 1700
梦毁少年i
梦毁少年i 2020-11-30 23:16

I\'ve been trying to use Airflow to schedule a DAG. One of the DAG includes a task which loads data from s3 bucket.

For the purpose above I need to setup s3 connecti

相关标签:
8条回答
  • 2020-11-30 23:20
    Conn Id: example_s3_connnection
    Conn Type: S3
    Extra:{"aws_access_key_id":"xxxxxxxxxx", "aws_secret_access_key": "yyyyyyyyyyy"}
    

    Note: Login and Password fields are left empty.

    0 讨论(0)
  • 2020-11-30 23:22

    We've added this to our docs a few versions ago:

    http://airflow.apache.org/docs/stable/howto/connection/aws.html

    There is no difference between an AWS connection and an S3 connection.

    The accepted answer here has key and secret in the extra/JSON, and while that still works (as of 1.10.10) it is not recommended anymore as it displays the secret in plain text in the UI.

    0 讨论(0)
  • 2020-11-30 23:23

    If you are worried about exposing the credentials in the UI, another way is to pass credential file location in the Extra param in UI. Only the functional user has read privileges to the file. It looks something like below

    Extra:  {
        "profile": "<profile_name>", 
        "s3_config_file": "/home/<functional_user>/creds/s3_credentials", 
        "s3_config_format": "aws" }
    

    file "/home/<functional_user>/creds/s3_credentials" has below entries

    [<profile_name>]
    aws_access_key_id = <access_key_id>
    aws_secret_access_key = <secret_key>
    
    0 讨论(0)
  • 2020-11-30 23:23

    For aws in China, It don't work on airflow==1.8.0 need update to 1.9.0 but airflow 1.9.0 change name to apache-airflow==1.9.0

    0 讨论(0)
  • For the new version, change the python code on above sample.

    s3_conn_id='my_conn_S3'
    

    to

    aws_conn_id='my_conn_s3'
    
    0 讨论(0)
  • 2020-11-30 23:39

    EDIT: This answer stores your secret key in plain text which can be a security risk and is not recommended. The best way is to put access key and secret key in the login/password fields, as mentioned in other answers below. END EDIT

    It's hard to find references, but after digging a bit I was able to make it work.

    TLDR

    Create a new connection with the following attributes:

    Conn Id: my_conn_S3

    Conn Type: S3

    Extra:

    {"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}
    

    Long version, setting up UI connection:

    • On Airflow UI, go to Admin > Connections
    • Create a new connection with the following attributes:
    • Conn Id: my_conn_S3
    • Conn Type: S3
    • Extra: {"aws_access_key_id":"_your_aws_access_key_id_", "aws_secret_access_key": "_your_aws_secret_access_key_"}
    • Leave all the other fields (Host, Schema, Login) blank.

    To use this connection, below you can find a simple S3 Sensor Test. The idea of this test is to set up a sensor that watches files in S3 (T1 task) and once below condition is satisfied it triggers a bash command (T2 task).

    Testing

    • Before running the DAG, ensure you've an S3 bucket named 'S3-Bucket-To-Watch'.
    • Add below s3_dag_test.py to airflow dags folder (~/airflow/dags)
    • Start airflow webserver.
    • Go to Airflow UI (http://localhost:8383/)
    • Start airflow scheduler.
    • Turn on 's3_dag_test' DAG on the main DAGs view.
    • Select 's3_dag_test' to show the dag details.
    • On the Graph View you should be able to see it's current state.
    • 'check_s3_for_file_in_s3' task should be active and running.
    • Now, add a file named 'file-to-watch-1' to your 'S3-Bucket-To-Watch'.
    • First tasks should have been completed, second should be started and finish.

    The schedule_interval in the dag definition is set to '@once', to facilitate debugging.

    To run it again, leave everything as it's, remove files in the bucket and try again by selecting the first task (in the graph view) and selecting 'Clear' all 'Past','Future','Upstream','Downstream' .... activity. This should kick off the DAG again.

    Let me know how it went.

    s3_dag_test.py ;

    """
    S3 Sensor Connection Test
    """
    
    from airflow import DAG
    from airflow.operators import SimpleHttpOperator, HttpSensor,   BashOperator, EmailOperator, S3KeySensor
    from datetime import datetime, timedelta
    
    default_args = {
        'owner': 'airflow',
        'depends_on_past': False,
        'start_date': datetime(2016, 11, 1),
        'email': ['something@here.com'],
        'email_on_failure': False,
        'email_on_retry': False,
        'retries': 5,
        'retry_delay': timedelta(minutes=5)
    }
    
    dag = DAG('s3_dag_test', default_args=default_args, schedule_interval= '@once')
    
    t1 = BashOperator(
        task_id='bash_test',
        bash_command='echo "hello, it should work" > s3_conn_test.txt',
        dag=dag)
    
    sensor = S3KeySensor(
        task_id='check_s3_for_file_in_s3',
        bucket_key='file-to-watch-*',
        wildcard_match=True,
        bucket_name='S3-Bucket-To-Watch',
        s3_conn_id='my_conn_S3',
        timeout=18*60*60,
        poke_interval=120,
        dag=dag)
    
    t1.set_upstream(sensor)
    

    Main References:
    • https://gitter.im/apache/incubator-airflow
    • https://groups.google.com/forum/#!topic/airbnb_airflow/TXsJNOBBfig
    • https://github.com/apache/incubator-airflow
    0 讨论(0)
提交回复
热议问题