https://stackoverflow.com/a/361059/14731 discusses the differences between \"word sort\" and \"string sort\".
How does one query programmatically when an SQL
Let's start with the definition of these types of sorts as given by Microsoft (taken from the "Remarks" section of the CompareOptions Enumeration MSDN page):
The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.
Unicode is culturally-sensitive and weighted, and the XML
and N
-prefixed types are Unicode, so they could be saying that data in Unicode types use "word sort" while data in the non-Unicode types uses "string sort". Ordinal refers to the BIN
and BIN2
collations, though the BIN
collations aren't 100% ordinal due to how they treat the first character.
But let's see what SQL Server says it is doing. Run the following:
DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
INSERT INTO @SampleData (ANSI, UTF16) VALUES
('a-b-c', N'a-b-c'),
('ac', N'ac'),
('aba', N'aba'),
('a-b', N'a-b'),
('ab', N'ab');
SELECT sd.ANSI AS [ANSI-Latin1_General_100_CI_AS]
FROM @SampleData sd
ORDER BY sd.ANSI COLLATE Latin1_General_100_CI_AS ASC;
SELECT sd.UTF16 AS [UTF16-Latin1_General_100_CI_AS]
FROM @SampleData sd
ORDER BY sd.UTF16 COLLATE Latin1_General_100_CI_AS ASC;
SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
FROM @SampleData sd
ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;
SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
FROM @SampleData sd
ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;
Results:
ANSI-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac
UTF16-Latin1_General_100_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac
ANSI-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
a-b
a-b-c
ab
aba
ac
UTF16-SQL_Latin1_General_CP1_CI_AS
-------------------------------------
ab
a-b
aba
a-b-c
ac
Hmm. Only the SQL_
collation in combination with the VARCHAR
field appears to be doing what could be considered "string sort". It makes sense that the SQL_
collation in combination with the NVARCHAR
field would do "word sort" it is the same Unicode handling as the non-SQL_
collations. But is there something besides being a SQL Server collation (i.e. starting with SQL_
) that determines "string" vs "word" sort? Let's look at the only properties of the collations that we can extract:
SELECT N'Latin1_General_100_CI_AS' AS [CollationName],
COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage') AS [CodePage],
COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'LCID') AS [LCID],
COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'ComparisonStyle') AS [ComparisonStyle]
UNION ALL
SELECT N'SQL_Latin1_General_CP1_CI_AS' AS [CollationName],
COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage'),
COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID'),
COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle');
Results:
CollationName CodePage LCID ComparisonStyle
---------------------------- -------- ---- ---------------
Latin1_General_100_CI_AS 1252 1033 196609
SQL_Latin1_General_CP1_CI_AS 1252 1033 196609
So, there are no discernible differences there. Which seems to leave us with this:
String sort is done when:
SQL_
, ANDFor more information on Unicode sorting in general, check out the following resources:
SQL_
collators, all other data is sorted according to "Unicode Collation" rules.According to https://support.microsoft.com/en-us/kb/322112
SQL Server 2000 supports two types of collations:
- SQL collations
- Windows collations
[...]
For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data.
[...]
A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules.
I interpret this as meaning that:
SQL_
collators are "SQL collations"SQL_
collators, all other data is sorted according to "Windows collations".So, let's dig into "Windows collations".
For Unicode data types, data comparisons are based on the Unicode code points.
// Sorting Flags.
//
// WORD Sort: culturally correct sort
// hyphen and apostrophe are special cased
// example: “coop” and “co-op” will sort together in a list
//
// co_op <——- underscore (symbol)
// coat
// comb
// coop
// co-op <——- hyphen (punctuation)
// cork
// went
// were
// we’re <——- apostrophe (punctuation)
//
//
// STRING Sort: hyphen and apostrophe will sort with all other symbols
//
// co-op <——- hyphen (punctuation)
// co_op <——- underscore (symbol)
// coat
// comb
// coop
// cork
// we’re <——- apostrophe (punctuation)
// went
// were
[...] all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.