Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

后端 未结 2 643
说谎
说谎 2020-12-01 09:04

Using pandas dataframe\'s to_sql method, I can write a small number of rows to a table in oracle database pretty easily:

from sqlalchemy import create_engine         


        
相关标签:
2条回答
  • 2020-12-01 09:28

    Just commenting here for posterity. I'm on python 3.6.8, pandas 1.1.3, sqlalchemy 1.3.20. When I tried implementing the solution from MaxU, I was initially encountering an error:

    raise ValueError(f"{col} ({my_type}) not a string")
    

    I honestly don't know why. After spending a couple hours debugging, this is what finally worked for me. In my case, I was trying to read from a CSV and insert to Oracle:

    import cx_Oracle
    import numpy as np
    import pandas as pd
    import sqlalchemy as sa
    from sqlalchemy import create_engine
    
    conn = create_engine('oracle://{}:{}@{}'.format(USERNAME, PASSWORD, DATABASE))
    
    df = pd.read_csv(...)
    object_columns = [c for c in df.columns[df.dtypes == 'object'].tolist()]
    dtyp = {c:sa.types.VARCHAR(df[c].str.len().max()) for c in object_columns}
    
    df.to_sql(..., dtype=dtyp)
    

    To be honest, I didn't really change much so not 100% sure why I was getting the original error, but just posting here in case it's helpful.

    0 讨论(0)
  • 2020-12-01 09:33

    Pandas + SQLAlchemy per default save all object (string) columns as CLOB in Oracle DB, which makes insertion extremely slow.

    Here are some tests:

    import pandas as pd
    import cx_Oracle
    from sqlalchemy import types, create_engine
    
    #######################################################
    ### DB connection strings config
    #######################################################
    tns = """
      (DESCRIPTION =
        (ADDRESS = (PROTOCOL = TCP)(HOST = my-db-scan)(PORT = 1521))
        (CONNECT_DATA =
          (SERVER = DEDICATED)
          (SERVICE_NAME = my_service_name)
        )
      )
    """
    
    usr = "test"
    pwd = "my_oracle_password"
    
    engine = create_engine('oracle+cx_oracle://%s:%s@%s' % (usr, pwd, tns))
    
    # sample DF [shape: `(2000, 11)`]
    # i took your 2 rows DF and replicated it: `df = pd.concat([df]* 10**3, ignore_index=True)`
    df = pd.read_csv('/path/to/file.csv')
    

    DF info:

    In [61]: df.shape
    Out[61]: (2000, 11)
    
    In [62]: df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2000 entries, 0 to 1999
    Data columns (total 11 columns):
    id               2000 non-null int64
    name             2000 non-null object
    premium          2000 non-null float64
    created_date     2000 non-null datetime64[ns]
    init_p           2000 non-null float64
    term_number      2000 non-null int64
    uprate           1000 non-null float64
    value            2000 non-null int64
    score            2000 non-null float64
    group            2000 non-null int64
    action_reason    2000 non-null object
    dtypes: datetime64[ns](1), float64(4), int64(4), object(2)
    memory usage: 172.0+ KB
    

    Let's check how long will it take to store it to Oracle DB:

    In [57]: df.shape
    Out[57]: (2000, 11)
    
    In [58]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace')
    1 loop, best of 1: 16 s per loop
    

    In Oracle DB (pay attention at CLOB's):

    AAA> desc test.test_table
     Name                            Null?    Type
     ------------------------------- -------- ------------------
     ID                                       NUMBER(19)
     NAME                                     CLOB        #  !!!
     PREMIUM                                  FLOAT(126)
     CREATED_DATE                             DATE
     INIT_P                                   FLOAT(126)
     TERM_NUMBER                              NUMBER(19)
     UPRATE                                   FLOAT(126)
     VALUE                                    NUMBER(19)
     SCORE                                    FLOAT(126)
     group                                    NUMBER(19)
     ACTION_REASON                            CLOB        #  !!!
    

    Now let's instruct pandas to save all object columns as VARCHAR data types:

    In [59]: dtyp = {c:types.VARCHAR(df[c].str.len().max())
        ...:         for c in df.columns[df.dtypes == 'object'].tolist()}
        ...:
    
    In [60]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp)
    1 loop, best of 1: 335 ms per loop
    

    This time it was approx. 48 times faster

    Check in Oracle DB:

     AAA> desc test.test_table
     Name                          Null?    Type
     ----------------------------- -------- ---------------------
     ID                                     NUMBER(19)
     NAME                                   VARCHAR2(13 CHAR)        #  !!!
     PREMIUM                                FLOAT(126)
     CREATED_DATE                           DATE
     INIT_P                                 FLOAT(126)
     TERM_NUMBER                            NUMBER(19)
     UPRATE                                 FLOAT(126)
     VALUE                                  NUMBER(19)
     SCORE                                  FLOAT(126)
     group                                  NUMBER(19)
     ACTION_REASON                          VARCHAR2(8 CHAR)        #  !!!
    

    Let's test it with 200.000 rows DF:

    In [69]: df.shape
    Out[69]: (200000, 11)
    
    In [70]: %timeit -n 1 -r 1 df.to_sql('test_table', engine, index=False, if_exists='replace', dtype=dtyp, chunksize=10**4)
    1 loop, best of 1: 4.68 s per loop
    

    It took ~5 seconds for 200K rows DF in my test (not the fastest) environment.

    Conclusion: use the following trick in order to explicitly specify dtype for all DF columns of object dtype when saving DataFrames to Oracle DB. Otherwise it'll be saved as CLOB data type, which requires special treatment and makes it very slow

    dtyp = {c:types.VARCHAR(df[c].str.len().max())
            for c in df.columns[df.dtypes == 'object'].tolist()}
    
    df.to_sql(..., dtype=dtyp)
    
    0 讨论(0)
提交回复
热议问题