问题
My problem is essentially this: When I try to use to_sql with if_exists = 'append' and name is set to a table on my SQL Server that already exists python crashes.
This is my code:
@event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)
I didn't include it but dataTypes is a dictionary of all the column names and their data type.
This is the error I get:
Traceback (most recent call last):
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
cursor.executemany(statement, parameters)
pyodbc.IntegrityError: ('23000', "[23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Violation of PRIMARY KEY constraint 'PK__existingSQLTable__'. Cannot insert duplicate key in object 'dbo.existingSQLTable'. The duplicate key value is (20008.7, 2008-08-07, Fl). (2627) (SQLExecute); [23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The statement has been terminated. (3621)")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
Table.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1165, in to_sql
chunksize=chunksize, dtype=dtype)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 571, in to_sql
chunksize=chunksize, dtype=dtype)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 1250, in to_sql
table.insert(chunksize)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 770, in insert
self._execute_insert(conn, keys, chunk_iter)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 745, in _execute_insert
conn.execute(self.insert_statement(), data)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 914, in execute
return meth(self, multiparams, params)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1146, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1341, in _handle_dbapi_exception
exc_info
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 185, in reraise
raise value.with_traceback(tb)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
cursor.executemany(statement, parameters)
Based on the errors, to me it appears that there's something wrong with the flag fast_executemany, but I've read a lot of documentation on it, and don't see anything wrong with it.
Things that may be of note:
- A table that does not already exist with if_exists = 'replace' works as expected
- A table that does not already exist with if_exists = 'append' works as expected
- A table that already exist with if_exists = 'replace' works as expected
- My DataFrame is about 3 million rows and 25 columns (mostly floats and some short strings)
- I can successfully write 900,000 rows absolute maximum without python crashing.
- I'm using a SQL Server, pandas 0.23.3, pyodbc 4.0.23 (I also get the same error with 4.0.22), Jupyter Notebook (I've also tried it in IDLE with the same result), Windows 10, Python 3.5.1, and Anaconda 3.
The obvious solution to me was to break the DataFrame up into chunks of 900,000 rows. While the first chunk is successfully uploaded, I cannot append even a single row to it without python crashing.
Is this error a result of the code meant to speed up the process (which it does fantastically)? Am I misunderstanding the to_sql function? Or is there something else going on? Any suggestions would be great! Also, if anyone has a similar problem it would be great to know!
回答1:
As @Jon Clements explained, the problem was that there were rows which had identical primary keys (but the rows weren't themselves identical). I used the pandas df.drop_duplicates function, with the subset parameter set to the primary key columns. This solved the Violation of PK error.
来源:https://stackoverflow.com/questions/51545670/pandas-to-sql-append-to-an-existing-table-causes-python-crash