I\'m seeing some very strange sorting behaviour using CaseInsensitiveComparer.DefaultInvariant. Words that start with a leading hyphen \"-\" end up sorted as if the hyphen w
To sort the strings in the way you need, you have to create a comparer class that compares strings using the Compareinfo class. This class allow you to specify various methods of comparison, the one that best matches yor needs is OrdinalIgnoreCase.
From MSDN:
Ignored Search Values
Comparison operations, such as those performed by the IndexOf or LastIndexOf methods, can yield unexpected results if the value to search for is ignored. The search value is ignored if it is an empty string (""), a character or string consisting of characters having code points that are not considered in the operation because of comparison options, or a value with code points that have no linguistic significance. If the search value for the IndexOf method is an empty string, for example, the return value is zero.
Note
When possible, the application should use string comparison methods that accept a CompareOptions value to specify the kind of comparison expected. As a general rule, user-facing comparisons are best served by the use of linguistic options (using the current culture), while security comparisons should specify Ordinal or OrdinalIgnoreCase.specify Ordinal or OrdinalIgnoreCase.
I have modified your test case, and this one execute correctly:
public class MyComparer:Comparer<string>
{
private readonly CompareInfo compareInfo;
public MyComparer()
{
compareInfo = CompareInfo.GetCompareInfo(CultureInfo.InvariantCulture.Name);
}
public override int Compare(string x, string y)
{
return compareInfo.Compare(x, y, CompareOptions.OrdinalIgnoreCase);
}
}
public class Class1
{
[Test]
public void TestMethod1()
{
var rg = new String[] {
"x", "z", "y", "-less", ".net", "- more", "a", "b"
};
Array.Sort(rg, new MyComparer());
Assert.AreEqual(
"- more,-less,.net,a,b,x,y,z",
String.Join(",", rg)
);
}
}
Comparison procedures use the CultureInfo.InvariantCulture to determine the sort order and casing rules. String comparisons might have different results depending on the culture. For more information on culture-specific comparisons, see the System.Globalization namespace and Encoding and Localization. From here.
The interesting part:
A word sort performs a culture-sensitive comparison of strings in which certain nonalphanumeric Unicode characters might have special weights assigned to them. For example, the hyphen (-) might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list. From here.
Sort order is dependent on the culture, so you can't assume characters will sort in ASCII order.
http://msdn.microsoft.com/en-us/library/a7zyyk0c.aspx
In your example, "h" (U+0048) is before "dash" (U+2013), so "hello" will appear before "-less". "." (U+002E) is before both, so ".net" appears first.
My guess would be that a dash immedately before a letter is being ignored, for purposes of sorting. When you sort a list of words, you'd like "inter-nation" and "international" to be next to each other, wouldn't you? A dash by itself, on the other hand, is considered significant.