Storing UTF-16/Unicode data in SQL Server

问题

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

回答1:

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

So one fix for the original problem is to adopt SQL Server 2012.

回答2:

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

So, as long as you use the proper functions in your code, everything should work transparently.

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

回答3:

something to add, that I just learned the hard way:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

来源：https://stackoverflow.com/questions/805259/storing-utf-16-unicode-data-in-sql-server

标签

sql-server

unicode

utf-16

ucs2