How to do store sql output to pandas dataframe using Airflow?

前端 未结 1 938
执念已碎
执念已碎 2021-01-15 15:53

I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow

Issue that I am facing is that conne

相关标签:
1条回答
  • 2021-01-15 16:29

    I doubt there's an in-built operator for this. You can easily write a custom operator

    • Extend PostgresOperator or just BaseOperator / any other operator of your choice. All custom code goes into the overridden execute() method
    • Then use PostgresHook to obtain a Pandas DataFrame by invoking get_pandas_df() function
    • Perform whatever transformations you have to do in your pandas df
    • Finally use insert_rows() function to insert data into table

    UPDATE-1

    As requested, I'm hereby adding the code for operator

    from typing import Dict, Any, List, Tuple
    
    from airflow.hooks.postgres_hook import PostgresHook
    from airflow.operators.postgres_operator import PostgresOperator
    from airflow.utils.decorators import apply_defaults
    from pandas import DataFrame
    
    
    class MyCustomOperator(PostgresOperator):
    
        @apply_defaults
        def __init__(self, destination_table: str, *args, **kwargs):
            super().__init__(*args, **kwargs)
            self.destination_table: str = destination_table
    
        def execute(self, context: Dict[str, Any]):
            # create PostgresHook
            self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id,
                                                   schema=self.database)
            # read data from Postgres-SQL query into pandas DataFrame
            df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
            # perform transformations on df here
            df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2)
            ..
            # convert pandas DataFrame into list of tuples
            rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None))
            # insert list of tuples in destination Postgres table
            self.hook.insert_rows(table=self.destination_table, rows=rows)
    

    Note: The snippet is for reference only; it has NOT been tested

    References

    • Pandas convert DataFrame into Array of tuples

    Further modifications / improvements

    • The destination_table param can be read from Variable
    • If the destination table doesn't necessarily reside in same Postgres schema, then we can take another param like destination_postgres_conn_id in __init__ and use that to create a destination_hook on which we can invoke insert_rows method
    0 讨论(0)
提交回复
热议问题