pyodbc/sqlAchemy enable fast execute many

前端 未结 1 780
醉梦人生
醉梦人生 2021-01-01 04:07

In response to my question How to speed up data wrangling A LOT in Python + Pandas + sqlAlchemy + MSSQL/T-SQL I was kindly directed to Speeding up pandas.DataFrame.to_sql wi

相关标签:
1条回答
  • 2021-01-01 04:35

    The error you received is caused by changes introduced in Pandas version 0.23.0, reverted in 0.23.1, and reintroduced in 0.24.0, as explained here. The produced VALUES clause contains 100,000 parameter markers and it'd seem that the count is stored in a signed 16 bit integer, so it overflows and you get the funny

    The SQL contains -31072 parameter markers, but 100000 parameters were supplied

    You can check for yourself:

    In [16]: 100000 % (2 ** 16) - 2 ** 16
    Out[16]: -31072
    

    If you would like to keep on using Pandas as is, you will have to calculate and provide a suitable chunksize value, such as the 100 you were using, taking into account both the maximum row limit of 1,000 for VALUES clause and the maximum parameter limit of 2,100 for stored procedures. The details are again explained in the linked Q/A.

    Before the change Pandas used to always use executemany() when inserting data. Newer versions detect if the dialect in use supports VALUES clause in INSERT. This detection happens in SQLTable.insert_statement() and cannot be controlled, which is a shame since PyODBC fixed their executemany() performance, given the right flag is enabled.

    In order to force Pandas to use executemany() with PyODBC again SQLTable has to be monkeypatched:

    import pandas.io.sql
    
    def insert_statement(self, data, conn):
        return self.table.insert(), data
    
    pandas.io.sql.SQLTable.insert_statement = insert_statement
    

    This will be horribly slow, if the Cursor.fast_executemany flag is not set, so remember to set the proper event handler.

    Here is a simple performance comparison, using the following dataframe:

    In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})
    

    Vanilla Pandas 0.24.0:

    In [14]: %time df.to_sql('foo', engine, chunksize=209)
    CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
    Wall time: 2min 26s
    

    Monkeypatched Pandas with fast executemany enabled:

    In [10]: %time df.to_sql('foo', engine, chunksize=500000)
    CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
    Wall time: 38 s
    
    0 讨论(0)
提交回复
热议问题