What .NET StringComparer is equivalent SQL's Latin1_General_CI_AS

前端 未结 4 741
借酒劲吻你
借酒劲吻你 2021-02-07 18:50

I am implementing a caching layer between my database and my C# code. The idea is to cache the results of certain DB queries based on the parameters to the query. The database i

相关标签:
4条回答
  • 2021-02-07 19:44

    The following is much simpler:

    System.Globalization.CultureInfo.GetCultureInfo(1033)
                  .CompareInfo.GetStringComparer(CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth)
    

    It comes from https://docs.microsoft.com/en-us/dotnet/api/system.globalization.globalizationextensions?view=netframework-4.8

    It computes the hashcode correctly given the options given. You'll still have to trim trailing spaces manually, as they as discarded by ANSI sql but not in .net

    Here is a wrapper that trims spaces.

    using System.Collections.Generic;
    using System.Globalization;
    
    namespace Wish.Core
    {
        public class SqlStringComparer : IEqualityComparer<string>
        {
            public static IEqualityComparer<string> Instance { get; }
    
            private static IEqualityComparer<string> _internalComparer =
                CultureInfo.GetCultureInfo(1033)
                           .CompareInfo
                           .GetStringComparer(CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth);
    
    
    
            private SqlStringComparer()
            {
            }
    
            public bool Equals(string x, string y)
            {
                //ANSI sql doesn't consider trailing spaces but .Net does
                return _internalComparer.Equals(x?.TrimEnd(), y?.TrimEnd());
            }
    
            public int GetHashCode(string obj)
            {
                return _internalComparer.GetHashCode(obj?.TrimEnd());
            }
    
            static SqlStringComparer()
            {
                Instance = new SqlStringComparer();
            }
        }
    }
    
    
    0 讨论(0)
  • 2021-02-07 19:49

    SQL Server's Server.GetStringComparer may be of some use.

    0 讨论(0)
  • 2021-02-07 19:51

    Take a look at the CollationInfo class. It is located in an assembly called Microsoft.SqlServer.Management.SqlParser.dll although I am not totally sure where to get this. There is a static list of Collations (names) and a static method GetCollationInfo (by name).

    Each CollationInfo has a Comparer. It is not exactly the same as a StringComparer but has similar functionality.

    EDIT: Microsoft.SqlServer.Management.SqlParser.dll is a part of the Shared Management Objects (SMO) package. This feature can be downloaded for SQL Server 2008 R2 here:

    http://www.microsoft.com/download/en/details.aspx?id=16978#SMO

    EDIT: CollationInfo does have a property named EqualityComparer which is an IEqualityComparer<string>.

    0 讨论(0)
  • 2021-02-07 19:55

    I've recently faced with the same problem: I need an IEqualityComparer<string> that behaves in SQL-like style. I've tried CollationInfo and its EqualityComparer. If your DB is always _AS (accent sensitive) then your solution will work, but in case if you change the collation that is AI or WI or whatever "insensitive" else the hashing will break.
    Why? If you decompile Microsoft.SqlServer.Management.SqlParser.dll and look inside you'll find out that CollationInfo internally uses CultureAwareComparer.GetHashCode (it's internal class of mscorlib.dll) and finally it does the following:

    public override int GetHashCode(string obj)
    {
      if (obj == null)
        throw new ArgumentNullException("obj");
      CompareOptions options = CompareOptions.None;
      if (this._ignoreCase)
        options |= CompareOptions.IgnoreCase;
      return this._compareInfo.GetHashCodeOfString(obj, options);
    }
    

    As you can see it can produce the same hashcode for "aa" and "AA", but not for "äå" and "aa" (which are the same, if you ignore diacritics (AI) in majority of cultures, so they should have the same hashcode). I don't know why the .NET API is limited by this, but you should understand where the problem can come from. To get the same hashcode for strings with diacritics you can do the following: create implementation of IEqualityComparer<T> implementing the GetHashCode that will call appropriate CompareInfo's object's GetHashCodeOfString via reflection because this method is internal and can't be used directly. But calling it directly with correct CompareOptions will produce the desired result: See this example:

        static void Main(string[] args)
        {
            const string outputPath = "output.txt";
            const string latin1GeneralCiAiKsWs = "Latin1_General_100_CI_AI_KS_WS";
            using (FileStream fileStream = File.Open(outputPath, FileMode.Create, FileAccess.Write))
            {
                using (var streamWriter = new StreamWriter(fileStream, Encoding.UTF8))
                {
                    string[] strings = { "aa", "AA", "äå", "ÄÅ" };
                    CompareInfo compareInfo = CultureInfo.GetCultureInfo(1033).CompareInfo;
                    MethodInfo GetHashCodeOfString = compareInfo.GetType()
                        .GetMethod("GetHashCodeOfString",
                        BindingFlags.Instance | BindingFlags.NonPublic,
                        null,
                        new[] { typeof(string), typeof(CompareOptions), typeof(bool), typeof(long) },
                        null);
    
                    Func<string, int> correctHackGetHashCode = s => (int)GetHashCodeOfString.Invoke(compareInfo,
                        new object[] { s, CompareOptions.IgnoreCase | CompareOptions.IgnoreNonSpace, false, 0L });
    
                    Func<string, int> incorrectCollationInfoGetHashCode =
                        s => CollationInfo.GetCollationInfo(latin1GeneralCiAiKsWs).EqualityComparer.GetHashCode(s);
    
                    PrintHashCodes(latin1GeneralCiAiKsWs, incorrectCollationInfoGetHashCode, streamWriter, strings);
                    PrintHashCodes("----", correctHackGetHashCode, streamWriter, strings);
                }
            }
            Process.Start(outputPath);
        }
        private static void PrintHashCodes(string collation, Func<string, int> getHashCode, TextWriter writer, params string[] strings)
        {
            writer.WriteLine(Environment.NewLine + "Used collation: {0}", collation + Environment.NewLine);
            foreach (string s in strings)
            {
                WriteStringHashcode(writer, s, getHashCode(s));
            }
        }
    

    The output is:

    Used collation: Latin1_General_100_CI_AI_KS_WS
    aa, hashcode: 2053722942
    AA, hashcode: 2053722942
    äå, hashcode: -266555795
    ÄÅ, hashcode: -266555795
    
    Used collation: ----
    aa, hashcode: 2053722942
    AA, hashcode: 2053722942
    äå, hashcode: 2053722942
    ÄÅ, hashcode: 2053722942
    

    I know it looks like the hack, but after inspecting decompiled .NET code I'm not sure if there any other option in case the generic functionality is needed. So be sure that you'll not fall into trap using this not fully correct API.
    UPDATE:
    I've also created the gist with potential implementation of "SQL-like comparer" using CollationInfo. Also there should be paid enough attention where to search for "string pitfalls" in your code base, so if the string comparison, hashcode, equality should be changed to "SQL collation-like" those places are 100% will be broken, so you'll have to find out and inspect all the places that can be broken.
    UPDATE #2:
    There is better and cleaner way to make GetHashCode() treat CompareOptions. There is the class SortKey that works correctly with CompareOptions and it can be retrieved using

    CompareInfo.GetSortKey(yourString, yourCompareOptions).GetHashCode()

    Here is the link to .NET source code and implementation.

    UPDATE #3:
    If you're on .NET Framework 4.7.1+ you should use new GlobalizationExtensions class as proposed by this recent answer.

    0 讨论(0)
提交回复
热议问题