Strings used in query always sent with NVARCHAR syntax, even if the underlying column is not unicode

问题

I'm noticing some odd behavior in the SQL generated for queries against string fields in MS SQL.

Server version: SQL Server 2014 12.0.5000.0

Collation: SQL_Latin1_General_CP1_CI_AS

Python version: 3.7

Our database has a mix of NVARCHAR (mostly newer) and VARCHAR (mostly older) fields. We are using SQLAlchemy to connect our Python application to the database, and even though we specify that a column is of type String (as opposed to Unicode), the executed SQL always comes out in NVARCHAR syntax (for example, N'foo').

This ends up creating some obvious problems, as a simple index lookup on a multi-million row table turns into a giant string re-encoding operation.

The workaround I discovered is to pass in bytestrings (a la s.encode("utf-8")) instead of strs, but this is incredibly error-prone and hackish. I expected SQLAlchemy to handle this automatically since I told it that I'm querying against a String column and not a Unicode column.

If this is supposed to happen automatically, then maybe it's because it doesn't know the database collation? If so, how would I go about setting this?

Finally, as another point of reference, we're using pymssql. I am aware, through previous experience before we were using SQLAlchemy, that pymssql does the same thing (it assumes unicode strings are NVARCHAR while bytestrings are not). Code here. As far as I can tell, SQLAlchemy just passes this off down the line. This behavior is a bit surprising to me since SQLAlchemy knows the column types and the type of connection/driver it's working with.

I'm not afraid to get my hands dirty, so if anyone happens to know where this could be reasonably patched, I'd be happy to contribute. My current investigation seems to indicate something to do with dialects and/or query/statement compilation.

I've uploaded a minimal example project to GitHub.

EDIT 2019-03-18: Updated with new information based on investigation.

EDIT 2019-03-23: Added GitHub repo with minimal example.

回答1:

I was able to reproduce the issue. Your MCVE was very helpful.

It was interesting to see that, for your ORM example, SQL Profiler showed no evidence that SQLAlchemy was retrieving the column metadata before running the SELECT query against the table. Apparently it believes that it knows enough about the columns to construct a working query, even though (as it turns out) it is not necessarily the most efficient one.

I knew that SQLAlchemy's SQL Expression Language would retrieve the table metadata, so I tried a similar SELECT using

metadata = MetaData()
my_table = Table('test', metadata, autoload=True, autoload_with=engine)
stmt = select([my_table.c.id, my_table.c.key])\
    .select_from(my_table)\
    .where(my_table.c.key == value)
cnxn = engine.connect()
items = cnxn.execute(stmt).fetchall()

and although SQLAlchemy did indeed retrieve the metadata using

SELECT [INFORMATION_SCHEMA].[columns].[table_schema],
       [INFORMATION_SCHEMA].[columns].[table_name],
       [INFORMATION_SCHEMA].[columns].[column_name],
       [INFORMATION_SCHEMA].[columns].[is_nullable],
       [INFORMATION_SCHEMA].[columns].[data_type],
       [INFORMATION_SCHEMA].[columns].[ordinal_position],
       [INFORMATION_SCHEMA].[columns].[character_maximum_length],
       [INFORMATION_SCHEMA].[columns].[numeric_precision],
       [INFORMATION_SCHEMA].[columns].[numeric_scale],
       [INFORMATION_SCHEMA].[columns].[column_default],
       [INFORMATION_SCHEMA].[columns].[collation_name]
FROM   [INFORMATION_SCHEMA].[columns]
WHERE  [INFORMATION_SCHEMA].[columns].[table_name] = Cast(
       N'test' AS NVARCHAR(max))
       AND [INFORMATION_SCHEMA].[columns].[table_schema] = Cast(
           N'dbo' AS NVARCHAR(max))
ORDER  BY [INFORMATION_SCHEMA].[columns].[ordinal_position]

a portion of whose output is

TABLE_SCHEMA  TABLE_NAME  COLUMN_NAME  IS_NULLABLE  DATA_TYPE  ORDINAL_POSITION  CHARACTER_MAXIMUM_LENGTH
------------  ----------  -----------  -----------  ---------  ----------------  ------------------------
dbo           test        id           NO           int        1                 NULL
dbo           test        key          NO           varchar    2                 50

the resulting SELECT query still used an nvarchar literal

SELECT test.id, test.[key] 
FROM test 
WHERE test.[key] = N'record123456'

Finally, I did the same tests using pyodbc instead of pymssql and the results were essentially the same. I was curious if SQLAlchemy's dialect for pyodbc might take advantage of setinputsizes to specify the parameter types (i.e., pyodbc.SQL_VARCHAR instead of pyodbc.SQL_WVARCHAR), but apparently it does not.

So, I'd say that for the time being your best bet is to continue encoding your string values into bytes that correspond to the character set of the varchar column you are querying (not utf-8). Of course, you can also dive into source code for the SQLAlchemy dialect(s) and submit a PR to make SQLAlchemy better.

来源：https://stackoverflow.com/questions/55098426/strings-used-in-query-always-sent-with-nvarchar-syntax-even-if-the-underlying-c

标签

sql-server

sqlalchemy

pymssql