How to find out whether collation uses word sort or string sort?

前端 未结 2 1310
耶瑟儿~
耶瑟儿~ 2021-01-19 05:47

https://stackoverflow.com/a/361059/14731 discusses the differences between \"word sort\" and \"string sort\".

How does one query programmatically when an SQL

相关标签:
2条回答
  • 2021-01-19 06:22

    Let's start with the definition of these types of sorts as given by Microsoft (taken from the "Remarks" section of the CompareOptions Enumeration MSDN page):

    The .NET Framework uses three distinct ways of sorting: word sort, string sort, and ordinal sort. Word sort performs a culture-sensitive comparison of strings. Certain nonalphanumeric characters might have special weights assigned to them. For example, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. String sort is similar to word sort, except that there are no special cases. Therefore, all nonalphanumeric symbols come before all alphanumeric characters. Ordinal sort compares strings based on the Unicode values of each element of the string.

    Unicode is culturally-sensitive and weighted, and the XML and N-prefixed types are Unicode, so they could be saying that data in Unicode types use "word sort" while data in the non-Unicode types uses "string sort". Ordinal refers to the BIN and BIN2 collations, though the BIN collations aren't 100% ordinal due to how they treat the first character.

    But let's see what SQL Server says it is doing. Run the following:

    DECLARE @SampleData TABLE (ANSI VARCHAR(50), UTF16 NVARCHAR(50));
    INSERT INTO @SampleData (ANSI, UTF16) VALUES 
        ('a-b-c', N'a-b-c'),
        ('ac', N'ac'),
        ('aba', N'aba'),
        ('a-b', N'a-b'),
        ('ab', N'ab');
    
    SELECT sd.ANSI AS [ANSI-Latin1_General_100_CI_AS]
    FROM   @SampleData sd
    ORDER BY sd.ANSI COLLATE Latin1_General_100_CI_AS ASC;
    
    SELECT sd.UTF16 AS [UTF16-Latin1_General_100_CI_AS]
    FROM   @SampleData sd
    ORDER BY sd.UTF16 COLLATE Latin1_General_100_CI_AS ASC;
    
    SELECT sd.ANSI AS [ANSI-SQL_Latin1_General_CP1_CI_AS]
    FROM   @SampleData sd
    ORDER BY sd.ANSI COLLATE SQL_Latin1_General_CP1_CI_AS ASC;
    
    SELECT sd.UTF16 AS [UTF16-SQL_Latin1_General_CP1_CI_AS]
    FROM   @SampleData sd
    ORDER BY sd.UTF16 COLLATE SQL_Latin1_General_CP1_CI_AS ASC;
    

    Results:

    ANSI-Latin1_General_100_CI_AS
    -------------------------------------
    ab
    a-b
    aba
    a-b-c
    ac
    
    UTF16-Latin1_General_100_CI_AS
    -------------------------------------
    ab
    a-b
    aba
    a-b-c
    ac
    
    ANSI-SQL_Latin1_General_CP1_CI_AS
    -------------------------------------
    a-b
    a-b-c
    ab
    aba
    ac
    
    UTF16-SQL_Latin1_General_CP1_CI_AS
    -------------------------------------
    ab
    a-b
    aba
    a-b-c
    ac
    

    Hmm. Only the SQL_ collation in combination with the VARCHAR field appears to be doing what could be considered "string sort". It makes sense that the SQL_ collation in combination with the NVARCHAR field would do "word sort" it is the same Unicode handling as the non-SQL_ collations. But is there something besides being a SQL Server collation (i.e. starting with SQL_ ) that determines "string" vs "word" sort? Let's look at the only properties of the collations that we can extract:

    SELECT N'Latin1_General_100_CI_AS' AS [CollationName],
           COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'CodePage') AS [CodePage],
           COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'LCID') AS [LCID],
          COLLATIONPROPERTY('Latin1_General_100_CI_AS', 'ComparisonStyle') AS [ComparisonStyle]
    UNION ALL
    SELECT N'SQL_Latin1_General_CP1_CI_AS' AS [CollationName],
           COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'CodePage'),
           COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'LCID'),
           COLLATIONPROPERTY('SQL_Latin1_General_CP1_CI_AS', 'ComparisonStyle');
    

    Results:

    CollationName                  CodePage   LCID    ComparisonStyle
    ----------------------------   --------   ----    ---------------
    Latin1_General_100_CI_AS       1252       1033    196609
    SQL_Latin1_General_CP1_CI_AS   1252       1033    196609
    

    So, there are no discernible differences there. Which seems to leave us with this:

    String sort is done when:

    • the collation name starts with SQL_, AND
    • the data (field, variable, string literal) is non-Unicode (i.e. CHAR / VARCHAR / TEXT)

    For more information on Unicode sorting in general, check out the following resources:

    • Unicode Collation Charts (per language) - shows the characters per each language, showing how they are sorted
    • Unicode Collation Algorithm (UCA) - some "light" (ha!) reading on the algorithm(s) used to sort Unicode data - this is the default algorithm that is in effect unless overridden by a particular locale's rules.
    • Collation Guidelines - how to read the locale-specific override charts
    0 讨论(0)
  • 2021-01-19 06:35
    • srutzky's excellent answer reveals that, with the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Unicode Collation" rules.
    • Confusingly, Microsoft does not use the Unicode standard's sorting rules.
    • According to https://support.microsoft.com/en-us/kb/322112

      SQL Server 2000 supports two types of collations:

      • SQL collations
      • Windows collations

      [...]

      For a Windows collation, a comparison of non-Unicode data is implemented by using the same algorithm as Unicode data.

      [...]

      A SQL collation's rules for sorting non-Unicode data are incompatible with any sort routine that is provided by the Microsoft Windows operating system; however, the sorting of Unicode data is compatible with a particular version of the Windows sorting rules.

    • I interpret this as meaning that:

      • SQL_ collators are "SQL collations"
      • All other collators are "Windows collators".
      • With the exception of non-Unicode types processed by SQL_ collators, all other data is sorted according to "Windows collations".

    So, let's dig into "Windows collations".

    • According to https://msdn.microsoft.com/en-us/library/ms143515(v=sql.105).aspx

      For Unicode data types, data comparisons are based on the Unicode code points.

    • winnls.h contains a brief overview of "word sort":
    //  Sorting Flags.
    //
    //    WORD Sort:    culturally correct sort
    //                  hyphen and apostrophe are special cased
    //                  example: “coop” and “co-op” will sort together in a list
    //
    //                        co_op     <——-  underscore (symbol)
    //                        coat
    //                        comb
    //                        coop
    //                        co-op     <——-  hyphen (punctuation)
    //                        cork
    //                        went
    //                        were
    //                        we’re     <——-  apostrophe (punctuation)
    //
    //
    //    STRING Sort:  hyphen and apostrophe will sort with all other symbols
    //
    //                        co-op     <——-  hyphen (punctuation)
    //                        co_op     <——-  underscore (symbol)
    //                        coat
    //                        comb
    //                        coop
    //                        cork
    //                        we’re     <——-  apostrophe (punctuation)
    //                        went
    //                        were
    
    • And finally, according to https://msdn.microsoft.com/en-us/library/windows/desktop/dd318144(v=vs.85).aspx

      [...] all punctuation marks and other nonalphanumeric characters, except for the hyphen and the apostrophe, come before any alphanumeric character. The hyphen and the apostrophe are treated differently from the other nonalphanumeric characters to ensure that words such as "coop" and "co-op" stay together in a sorted list.

    0 讨论(0)
提交回复
热议问题