I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow
Issue that I am facing is that conne
I doubt there's an in-built operator for this. You can easily write a custom operator
PostgresOperator
or just BaseOperator
/ any other operator of your choice. All custom code goes into the overridden execute() methodPandas
DataFrame
by invoking get_pandas_df() functionpandas
df
UPDATE-1
As requested, I'm hereby adding the code for operator
from typing import Dict, Any, List, Tuple
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.utils.decorators import apply_defaults
from pandas import DataFrame
class MyCustomOperator(PostgresOperator):
@apply_defaults
def __init__(self, destination_table: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.destination_table: str = destination_table
def execute(self, context: Dict[str, Any]):
# create PostgresHook
self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id,
schema=self.database)
# read data from Postgres-SQL query into pandas DataFrame
df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
# perform transformations on df here
df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2)
..
# convert pandas DataFrame into list of tuples
rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None))
# insert list of tuples in destination Postgres table
self.hook.insert_rows(table=self.destination_table, rows=rows)
Note: The snippet is for reference only; it has NOT been tested
References
Further modifications / improvements
destination_table
param can be read from VariablePostgres
schema, then we can take another param like destination_postgres_conn_id
in __init__
and use that to create a destination_hook
on which we can invoke insert_rows
method