Pandas to_sql 'append' to an existing table causes Python crash

问题

My problem is essentially this: When I try to use to_sql with if_exists = 'append' and name is set to a table on my SQL Server that already exists python crashes.

This is my code:

@event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
    if executemany:
        cursor.fast_executemany = True

df.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)

I didn't include it but dataTypes is a dictionary of all the column names and their data type.

This is the error I get:

    Traceback (most recent call last):
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
        context)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
        cursor.executemany(statement, parameters)
    pyodbc.IntegrityError: ('23000', "[23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Violation of PRIMARY KEY constraint 'PK__existingSQLTable__'. Cannot insert duplicate key in object 'dbo.existingSQLTable'. The duplicate key value is (20008.7, 2008-08-07, Fl). (2627) (SQLExecute); [23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The statement has been terminated. (3621)")

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<pyshell#24>", line 1, in <module>
        Table.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)
      File "C:\Apps\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1165, in to_sql
        chunksize=chunksize, dtype=dtype)
      File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 571, in to_sql
        chunksize=chunksize, dtype=dtype)
      File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 1250, in to_sql
        table.insert(chunksize)
      File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 770, in insert
        self._execute_insert(conn, keys, chunk_iter)
      File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 745, in _execute_insert
        conn.execute(self.insert_statement(), data)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 914, in execute
        return meth(self, multiparams, params)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 323, in _execute_on_connection
        return connection._execute_clauseelement(self, multiparams, params)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1010, in _execute_clauseelement
        compiled_sql, distilled_params
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1146, in _execute_context
        context)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1341, in _handle_dbapi_exception
        exc_info
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 202, in raise_from_cause
        reraise(type(exception), exception, tb=exc_tb, cause=cause)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 185, in reraise
        raise value.with_traceback(tb)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
        context)
      File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
        cursor.executemany(statement, parameters)

Based on the errors, to me it appears that there's something wrong with the flag fast_executemany, but I've read a lot of documentation on it, and don't see anything wrong with it.

Things that may be of note:

A table that does not already exist with if_exists = 'replace' works as expected
A table that does not already exist with if_exists = 'append' works as expected
A table that already exist with if_exists = 'replace' works as expected
My DataFrame is about 3 million rows and 25 columns (mostly floats and some short strings)
I can successfully write 900,000 rows absolute maximum without python crashing.
I'm using a SQL Server, pandas 0.23.3, pyodbc 4.0.23 (I also get the same error with 4.0.22), Jupyter Notebook (I've also tried it in IDLE with the same result), Windows 10, Python 3.5.1, and Anaconda 3.

The obvious solution to me was to break the DataFrame up into chunks of 900,000 rows. While the first chunk is successfully uploaded, I cannot append even a single row to it without python crashing.

Is this error a result of the code meant to speed up the process (which it does fantastically)? Am I misunderstanding the to_sql function? Or is there something else going on? Any suggestions would be great! Also, if anyone has a similar problem it would be great to know!

回答1:

As @Jon Clements explained, the problem was that there were rows which had identical primary keys (but the rows weren't themselves identical). I used the pandas df.drop_duplicates function, with the subset parameter set to the primary key columns. This solved the Violation of PK error.

来源：https://stackoverflow.com/questions/51545670/pandas-to-sql-append-to-an-existing-table-causes-python-crash

标签

python-3.x

pandas

sql-server-2012

sqlalchemy

pandas-to-sql