问题
I'm noticing some odd behavior in the SQL generated for queries against string fields in MS SQL.
Server version: SQL Server 2014 12.0.5000.0
Collation: SQL_Latin1_General_CP1_CI_AS
Python version: 3.7
Our database has a mix of NVARCHAR
(mostly newer) and VARCHAR
(mostly older) fields. We are using SQLAlchemy to connect our Python application to the database, and even though we specify that a column is of type String
(as opposed to Unicode
), the executed SQL always comes out in NVARCHAR
syntax (for example, N'foo'
).
This ends up creating some obvious problems, as a simple index lookup on a multi-million row table turns into a giant string re-encoding operation.
The workaround I discovered is to pass in bytestrings (a la s.encode("utf-8")
) instead of str
s, but this is incredibly error-prone and hackish. I expected SQLAlchemy to handle this automatically since I told it that I'm querying against a String
column and not a Unicode
column.
If this is supposed to happen automatically, then maybe it's because it doesn't know the database collation? If so, how would I go about setting this?
Finally, as another point of reference, we're using pymssql. I am aware, through previous experience before we were using SQLAlchemy, that pymssql does the same thing (it assumes unicode strings are NVARCHAR
while bytestrings are not). Code here. As far as I can tell, SQLAlchemy just passes this off down the line. This behavior is a bit surprising to me since SQLAlchemy knows the column types and the type of connection/driver it's working with.
I'm not afraid to get my hands dirty, so if anyone happens to know where this could be reasonably patched, I'd be happy to contribute. My current investigation seems to indicate something to do with dialects and/or query/statement compilation.
I've uploaded a minimal example project to GitHub.
EDIT 2019-03-18: Updated with new information based on investigation.
EDIT 2019-03-23: Added GitHub repo with minimal example.
回答1:
I was able to reproduce the issue. Your MCVE was very helpful.
It was interesting to see that, for your ORM example, SQL Profiler showed no evidence that SQLAlchemy was retrieving the column metadata before running the SELECT query against the table. Apparently it believes that it knows enough about the columns to construct a working query, even though (as it turns out) it is not necessarily the most efficient one.
I knew that SQLAlchemy's SQL Expression Language would retrieve the table metadata, so I tried a similar SELECT using
metadata = MetaData()
my_table = Table('test', metadata, autoload=True, autoload_with=engine)
stmt = select([my_table.c.id, my_table.c.key])\
.select_from(my_table)\
.where(my_table.c.key == value)
cnxn = engine.connect()
items = cnxn.execute(stmt).fetchall()
and although SQLAlchemy did indeed retrieve the metadata using
SELECT [INFORMATION_SCHEMA].[columns].[table_schema],
[INFORMATION_SCHEMA].[columns].[table_name],
[INFORMATION_SCHEMA].[columns].[column_name],
[INFORMATION_SCHEMA].[columns].[is_nullable],
[INFORMATION_SCHEMA].[columns].[data_type],
[INFORMATION_SCHEMA].[columns].[ordinal_position],
[INFORMATION_SCHEMA].[columns].[character_maximum_length],
[INFORMATION_SCHEMA].[columns].[numeric_precision],
[INFORMATION_SCHEMA].[columns].[numeric_scale],
[INFORMATION_SCHEMA].[columns].[column_default],
[INFORMATION_SCHEMA].[columns].[collation_name]
FROM [INFORMATION_SCHEMA].[columns]
WHERE [INFORMATION_SCHEMA].[columns].[table_name] = Cast(
N'test' AS NVARCHAR(max))
AND [INFORMATION_SCHEMA].[columns].[table_schema] = Cast(
N'dbo' AS NVARCHAR(max))
ORDER BY [INFORMATION_SCHEMA].[columns].[ordinal_position]
a portion of whose output is
TABLE_SCHEMA TABLE_NAME COLUMN_NAME IS_NULLABLE DATA_TYPE ORDINAL_POSITION CHARACTER_MAXIMUM_LENGTH
------------ ---------- ----------- ----------- --------- ---------------- ------------------------
dbo test id NO int 1 NULL
dbo test key NO varchar 2 50
the resulting SELECT query still used an nvarchar
literal
SELECT test.id, test.[key]
FROM test
WHERE test.[key] = N'record123456'
Finally, I did the same tests using pyodbc
instead of pymssql
and the results were essentially the same. I was curious if SQLAlchemy's dialect for pyodbc might take advantage of setinputsizes to specify the parameter types (i.e., pyodbc.SQL_VARCHAR
instead of pyodbc.SQL_WVARCHAR
), but apparently it does not.
So, I'd say that for the time being your best bet is to continue encoding your string values into bytes that correspond to the character set of the varchar
column you are querying (not utf-8). Of course, you can also dive into source code for the SQLAlchemy dialect(s) and submit a PR to make SQLAlchemy better.
来源:https://stackoverflow.com/questions/55098426/strings-used-in-query-always-sent-with-nvarchar-syntax-even-if-the-underlying-c