Snowflake pandas pd_writer writes out tables with NULLs

久未见 提交于 2020-12-26 06:45:42

问题


I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.

I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.

I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.

How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?

EDIT: Following Chris' answer, I decided to try with the official example. Here's my code and the result set:

import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer


def create_db_engine(db_name, schema_name):
    return create_engine(
        URL(
            account=os.environ.get("DB_ACCOUNT"),
            user=os.environ.get("DB_USERNAME"),
            password=os.environ.get("DB_PASSWORD"),
            database=db_name,
            schema=schema_name,
            warehouse=os.environ.get("DB_WAREHOUSE"),
            role=os.environ.get("DB_ROLE"),
        )
    )


def create_table(out_df, table_name, idx=False):
    engine = create_db_engine("dummy_db", "dummy_schema")
    connection = engine.connect()

    try:
        out_df.to_sql(
            table_name, connection, if_exists="append", index=idx, method=pd_writer
        )

    except ConnectionError:
        print("Unable to connect to database!")

    finally:
        connection.close()
        engine.dispose()

    return True


df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])

print(df.head)

create_table(df, "dummy_demo_table")

The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.


回答1:


Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.

My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.




回答2:


I have had this exact same issue, don't despair there is a solution in sight. When you create a table in snowflake, from the snowflake worksheet or snowflake environment, it names the object and all columns and constraints in uppercase. However when you create the table from Python using the data frame, the object gets created in the exact case that you specified in your data frame. In your case it is columns=['name', 'balance']). So when the insert happens, it looks for all uppercase column names in snowflake and cannot find it, it does the insert but sets your 2 columns to null as the columns are created as nullable.

Best way to get pass this issue is to create your columns in uppercase in the dataframe, columns=['NAME', 'BALANCE']).

I do think this is something that snowflake should address and fix as it is not an expected behavior.

Even if you tried to do a select from your table that has nulls you would get an error eg: select name, balance from dummy_demo_table

You would probably get an error like the following, SQL compilation error: error line 1 at position 7 invalid identifier 'name'

BUT the following will work SELECT * from dummy_demo_table




回答3:


I assume you've already looked at the documentation and seen this example code:

import pandas
from snowflake.connector.pandas_tools import pd_writer

# Create a DataFrame containing data about customers
df = pandas.DataFrame([('Mark', 10), ('Luke', 20)], columns=['name', 'balance'])

# Specify that the to_sql method should use the pd_writer function
# to write the data from the DataFrame to the table named "customers"
# in the Snowflake database.
df.to_sql('customers', engine, index=False, method=pd_writer)

I don't think we can help you beyond this without a sample of your Python code with some matching dummy data so that we can investigate further. My advice would be to first get the sample above working, then test the process on smaller dataframes that don't include any NaNs, then grow from there.



来源:https://stackoverflow.com/questions/63370701/snowflake-pandas-pd-writer-writes-out-tables-with-nulls

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!