Why does string.Compare seem to handle accented characters inconsistently?

后端 未结 3 1968
孤城傲影
孤城傲影 2020-12-10 02:52

If I execute the following statement:

string.Compare(\"mun\", \"mün\", true, CultureInfo.InvariantCulture)

The result is \'-1\', indicating

相关标签:
3条回答
  • 2020-12-10 03:22

    There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/

    To address the complexities of language-sensitive sorting, a multilevel comparison algorithm is employed. In comparing two words, for example, the most important feature is the base character: such as the difference between an A and a B. Accent differences are typically ignored, if there are any differences in the base letters. Case differences (uppercase versus lowercase), are typically ignored, if there are any differences in the base or accents. Punctuation is variable. In some situations a punctuation character is treated like a base character. In other situations, it should be ignored if there are any base, accent, or case differences. There may also be a final, tie-breaking level, whereby if there are no other differences at all in the string, the (normalized) code point order is used.

    So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".

    Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared

    0 讨论(0)
  • 2020-12-10 03:28

    It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.

    Here's some sample code to demonstrate:

    using System;
    using System.Globalization;
    
    class Test
    {
        static void Main()
        {
            Compare("mun", "mün");
            Compare("muna", "münb");
            Compare("munb", "müna");
        }
    
        static void Compare(string x, string y)
        {
            int result = string.Compare(x, y, true, 
                                       CultureInfo.InvariantCulture));
    
            Console.WriteLine("{0}; {1}; {2}", x, y, result);
        }
    }
    

    (I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)

    Results:

    mun; mün; -1
    muna; münb; -1
    munb; müna; 1
    

    I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.

    As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?

    0 讨论(0)
  • 2020-12-10 03:42

    As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.

    As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.

    I couldn't find any clear documentation in MSDN explaining these rules, but I found that

    string.Compare("mun", "mün", CultureInfo.InvariantCulture,  
        CompareOptions.StringSort);
    

    and

    string.Compare("Muntelier, Schweiz", "München, Deutschland", 
        CultureInfo.InvariantCulture, CompareOptions.StringSort);
    

    gives the desired result.

    Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

    0 讨论(0)
提交回复
热议问题