What .NET StringComparer is equivalent SQL's Latin1_General_CI_AS

前端 未结 4 764
借酒劲吻你
借酒劲吻你 2021-02-07 18:50

I am implementing a caching layer between my database and my C# code. The idea is to cache the results of certain DB queries based on the parameters to the query. The database i

4条回答
  •  星月不相逢
    2021-02-07 19:55

    I've recently faced with the same problem: I need an IEqualityComparer that behaves in SQL-like style. I've tried CollationInfo and its EqualityComparer. If your DB is always _AS (accent sensitive) then your solution will work, but in case if you change the collation that is AI or WI or whatever "insensitive" else the hashing will break.
    Why? If you decompile Microsoft.SqlServer.Management.SqlParser.dll and look inside you'll find out that CollationInfo internally uses CultureAwareComparer.GetHashCode (it's internal class of mscorlib.dll) and finally it does the following:

    public override int GetHashCode(string obj)
    {
      if (obj == null)
        throw new ArgumentNullException("obj");
      CompareOptions options = CompareOptions.None;
      if (this._ignoreCase)
        options |= CompareOptions.IgnoreCase;
      return this._compareInfo.GetHashCodeOfString(obj, options);
    }
    

    As you can see it can produce the same hashcode for "aa" and "AA", but not for "äå" and "aa" (which are the same, if you ignore diacritics (AI) in majority of cultures, so they should have the same hashcode). I don't know why the .NET API is limited by this, but you should understand where the problem can come from. To get the same hashcode for strings with diacritics you can do the following: create implementation of IEqualityComparer implementing the GetHashCode that will call appropriate CompareInfo's object's GetHashCodeOfString via reflection because this method is internal and can't be used directly. But calling it directly with correct CompareOptions will produce the desired result: See this example:

        static void Main(string[] args)
        {
            const string outputPath = "output.txt";
            const string latin1GeneralCiAiKsWs = "Latin1_General_100_CI_AI_KS_WS";
            using (FileStream fileStream = File.Open(outputPath, FileMode.Create, FileAccess.Write))
            {
                using (var streamWriter = new StreamWriter(fileStream, Encoding.UTF8))
                {
                    string[] strings = { "aa", "AA", "äå", "ÄÅ" };
                    CompareInfo compareInfo = CultureInfo.GetCultureInfo(1033).CompareInfo;
                    MethodInfo GetHashCodeOfString = compareInfo.GetType()
                        .GetMethod("GetHashCodeOfString",
                        BindingFlags.Instance | BindingFlags.NonPublic,
                        null,
                        new[] { typeof(string), typeof(CompareOptions), typeof(bool), typeof(long) },
                        null);
    
                    Func correctHackGetHashCode = s => (int)GetHashCodeOfString.Invoke(compareInfo,
                        new object[] { s, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, false, 0L });
    
                    Func incorrectCollationInfoGetHashCode =
                        s => CollationInfo.GetCollationInfo(latin1GeneralCiAiKsWs).EqualityComparer.GetHashCode(s);
    
                    PrintHashCodes(latin1GeneralCiAiKsWs, incorrectCollationInfoGetHashCode, streamWriter, strings);
                    PrintHashCodes("----", correctHackGetHashCode, streamWriter, strings);
                }
            }
            Process.Start(outputPath);
        }
        private static void PrintHashCodes(string collation, Func getHashCode, TextWriter writer, params string[] strings)
        {
            writer.WriteLine(Environment.NewLine + "Used collation: {0}", collation + Environment.NewLine);
            foreach (string s in strings)
            {
                WriteStringHashcode(writer, s, getHashCode(s));
            }
        }
    

    The output is:

    Used collation: Latin1_General_100_CI_AI_KS_WS
    aa, hashcode: 2053722942
    AA, hashcode: 2053722942
    äå, hashcode: -266555795
    ÄÅ, hashcode: -266555795
    
    Used collation: ----
    aa, hashcode: 2053722942
    AA, hashcode: 2053722942
    äå, hashcode: 2053722942
    ÄÅ, hashcode: 2053722942
    

    I know it looks like the hack, but after inspecting decompiled .NET code I'm not sure if there any other option in case the generic functionality is needed. So be sure that you'll not fall into trap using this not fully correct API.
    UPDATE:
    I've also created the gist with potential implementation of "SQL-like comparer" using CollationInfo. Also there should be paid enough attention where to search for "string pitfalls" in your code base, so if the string comparison, hashcode, equality should be changed to "SQL collation-like" those places are 100% will be broken, so you'll have to find out and inspect all the places that can be broken.
    UPDATE #2:
    There is better and cleaner way to make GetHashCode() treat CompareOptions. There is the class SortKey that works correctly with CompareOptions and it can be retrieved using

    CompareInfo.GetSortKey(yourString, yourCompareOptions).GetHashCode()

    Here is the link to .NET source code and implementation.

    UPDATE #3:
    If you're on .NET Framework 4.7.1+ you should use new GlobalizationExtensions class as proposed by this recent answer.

提交回复
热议问题