All 3 options are case and accent sensitive, and support Unicode. According to the documentation:
NVarchar sorts and compares data based on the \"dictionari
All 3 options are case and accent sensitive, and support Unicode.
NVARCHAR
is a datatype (like INT
, DATETIME
, etc.) and not an option. It stores Unicode characters in the UCS-2 / UTF-16 (Little Endian) encoding. UCS-2 and UTF-16 are the identical code points for the U+0000 through U+FFFF (decimal values 0 - 65535) range. UTF-16 handles code points U+10000 and above (known as Supplementary Characters), all of which are defined as pairs of code points (known as Surrogate Pairs) that exist in the UCS-2 range. Since the byte sequences are identical between the two, the only difference is in the handling of the data. Meaning, built-in functions do not know how to interpret Supplementary Characters when using Collations that do not end in _SC
, whereas they do work correctly for the full UTF-16 range when using Collations that do end in _SC
. The _SC
Collations were added in SQL Server 2012, but you can still store and retrieve Supplementary Characters in prior versions; it is only the built-in functions that do not behave as expected when operating on Supplementary Characters.
More directly:
NVARCHAR
, being a datatype, is not inherently case or accent (or any other sensitivity) sensitive or insensitive. The exact behavior depends on the collation set for the column, or the database's default collation, or the COLLATE
clause, depending on the context of the expression.Collations, while literally being about how characters sort and compare to each other, in SQL Server also imply the Locale / LCID (which determines the cultural rules that override the default handling of those comparisons) and the Code Page used for VARCHAR
data.
Non-binary collations are considered "dictionary" sorting / comparisons because they take into account the rules of the particular culture specified by the Collation (specifically the associated LCID). On the other hand, binary collations do not deal with any culture-specific rules and only sort and compare based on the numeric value of each 2-byte sequence. For this reason binary collations are much faster because they don't need to apply a large list of rules, but they also have no way to know that single two-byte Code Point that is a u
with an accent is not the same as 2 two-byte sequences which are a u
and a separate accent that will render on screen the same as the single two-byte code point, and will compare as being equal when using a non-binary collation.
The difference between _BIN
and _BIN2
is sorting accuracy, not performance. The older _BIN
collations do a simplistic byte-by-byte sorting and comparison (after the first character, which is seen as a code point and not two bytes, thus it sorts correctly) whereas the newer _BIN2
collations (starting in SQL Server 2005) compare each Code "Unit" (Supplementary Characters are made up of two Code Units, and _BIN2
collations see each Code Unit individually instead of seeing the combination of them as a Code Point). There is a difference in sort order between these two approaches mainly due to SQL Server being "Little Endian" which stores bytes (for a single entity: UTF-16 code unit, INT
value, BIGINT
value, etc) in reverse order. Hence, code point U+0206 will actually sort after U+0402 when using a _BIN
collation:
SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN;
/*
ID Thing ThingBytes
2 aЂ 0x61000204
1 aȆ 0x61000602 <-- U+0206, stored as 0x06 then 0x02, should sort first
*/
SELECT *, CONVERT(VARBINARY(20), tmp.[Thing]) AS [ThingBytes]
FROM (VALUES (1, N'a' + NCHAR(0x0206)), (2, N'a' + NCHAR(0x0402))) tmp ([ID], [Thing])
ORDER BY tmp.[Thing] COLLATE Latin1_General_100_BIN2;
/*
ID Thing ThingBytes
1 aȆ 0x61000602
2 aЂ 0x61000204
*/
For more details and examples of this distinction, please see: Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2).
Also, all binary collations sort and compare in exactly the same manner when it comes to Unicode / NVARCHAR
data. Code Points are numerical values and there are no linguistic / cultural variations to consider when comparing them. Hence the only purpose in having more than a single, global "BINARY" Collation is the need to still specify the Code Page to use for VARCHAR
data.
Suppose I want to store private names in different alphabets, which option may I use?
If you were using VARCHAR
fields, then the Collation specific (regardless of binary or non-binary) would determine which characters are available since that is 8-bit Extended ASCII which typically has a range of 256 different characters (unless using a Double-Byte Character Set, in which case it can handle many more, but those are still mostly of a single culture / alphabet). If using NVARCHAR
to store the data, since that is Unicode it has a single character set comprised of all characters from all languages, plus lots of other stuff.
So choosing NVARCHAR
takes care of the problem of being able to hold the proper characters of names coming from various languages. HOWEVER, you still need to pick a particular cultures dictionary rules in order to sort in a manner that each particular culture expects. This is a problem because Collations cannot be set dynamically. So pick the one that is used the most. Binary collations will not help you here, and in fact would go against what you are trying to do. They are, however, quite handy when you need to distinguish between characters that would otherwise equate, such as in this case: SQL server filtering CJK punctuation characters (here on S.O.).
Another related scenario in which I have used a _BIN2
collation was detecting case changes in URLs. Some parts of a URL are case-insensitive, such as the hostname / domain name. But, in the QueryString, the values being passed in are potentially sensitive. If you compare URL values in a case-insensitive operation, then http://domain.tld/page.ext?var1=val
would equate to http://domain.tld/page.ext?var1=VAL
, and those values should not be assumed to be the same. Using a case-sensitive Collation would also typically work, but I use Latin1_General_100_BIN2
because it's faster (no linguistic rules) and would not ignore a change of ü
to u
+ combining diaeresis (which renders as ü
).
I have more explanations of Collations spread across the following answers (so won't duplicate here as most of them contain several examples):
And these are on DBA.StackExchange:
For more info on working with Collations, Encodings, Unicode, etc, please visit: Collations Info
nvarchar
is a data type, and the "BIN" or "BIN2" collations are just that - collation sequences. They are two different things.
You use an nvarchar
column to store unicode character data:
nchar and nvarchar (Transact-SQL)
String data types that are either fixed-length, nchar, or variable-length, nvarchar, Unicode data and use the UNICODE UCS-2 character set.
https://msdn.microsoft.com/en-GB/library/ms186939(v=sql.105).aspx
An nvarchar
column will have an associated collation sequence that defines how the characters sort and compare. This can also be set for the whole database.
COLLATE (Transact-SQL)
Is a clause that can be applied to a database definition or a column definition to define the collation, or to a character string expression to apply a collation cast.
https://msdn.microsoft.com/en-us/library/ms184391(v=sql.105).aspx
So, when working with character data in SQL server, you always use both a character data-type (nvarchar, varchar, nchar or char) along with an appropriate collation according to your needs for case-sensitivity, accent-sensitivity etc.
For example, in my work I normally use the "Latin1_General_CI_AI" collation. This is suitable for latin character sets, and provides case-insensitive and accent-insensitive matching for queries. That means that the following strings are all considered to be equal:
This is ideal for systems where there may be words containing accented characters (as above), but you can't be sure you users will enter the accents when searching for something.
If you only wanted case-insensitivity then you would use a "CI_AS" (accent sensitive) collation instead.
The "_BIN" collations are for binary comparisons that treat every distinct character as different, and wouldn't be used for general text comparisons.
Edit for updated question:
Provided that you always use nvarchar
(as opposed to varchar
) columns then you always have support for all unicode code points, no matter what collation is used.
There is no practical difference in your example query, as it is only a simple insert and select. Also bear in mind that your first "word1" column will be using the database or server's default collation - there's always a collation in use!
Where the differences will occur is if you use criteria against your nvarchar columns, or sort by them. This is what collations are for - they define which characters should be treated as equivalent for comparisons and sorting.
I can't say anything about Cyrillic, but in the case of Latin characters, using the "Latin1_General_CI_AI" collation, then characters such as A
a
á
â
etc are all equivalent - the case and the accent are ignored.
Imagine if you have the string Aaáâ
stored in your "word1" column, then the query SELECT * FROM words2 WHERE word1 = 'aaaa'
will return your row.
If you use a "_BIN" collation then all these characters are treated as distinct, and the query above would not return a row. I can't think of a situation where you'd want to use a "_BIN" collation when working with textual data. Edit 2: Actually I can - storing password hashes would be a good place to use a binary collation, so that comparisons are exact. That's about all.
I hope this makes it clearer.