Ignoring accented letters in string comparison

后端 未结 6 1145
一向
一向 2020-11-22 10:37

I need to compare 2 strings in C# and treat accented letters the same as non-accented letters. For example:

string s1 = \"hello\";
string s2 = \"héllo\";

s1         


        
相关标签:
6条回答
  • 2020-11-22 11:04

    try this overload on the String.Compare Method.

    String.Compare Method (String, String, Boolean, CultureInfo)

    It produces a int value based on the compare operations including cultureinfo. the example in the page compares "Change" in en-US and en-CZ. CH in en-CZ is a single "letter".

    example from the link

    using System;
    using System.Globalization;
    
    class Sample {
        public static void Main() {
        String str1 = "change";
        String str2 = "dollar";
        String relation = null;
    
        relation = symbol( String.Compare(str1, str2, false, new CultureInfo("en-US")) );
        Console.WriteLine("For en-US: {0} {1} {2}", str1, relation, str2);
    
        relation = symbol( String.Compare(str1, str2, false, new CultureInfo("cs-CZ")) );
        Console.WriteLine("For cs-CZ: {0} {1} {2}", str1, relation, str2);
        }
    
        private static String symbol(int r) {
        String s = "=";
        if      (r < 0) s = "<";
        else if (r > 0) s = ">";
        return s;
        }
    }
    /*
    This example produces the following results.
    For en-US: change < dollar
    For cs-CZ: change > dollar
    */
    

    therefor for accented languages you will need to get the culture then test the strings based on that.

    http://msdn.microsoft.com/en-us/library/hyxc48dt.aspx

    0 讨论(0)
  • 2020-11-22 11:07

    If you don't need to convert the string and you just want to check for equality you can use

    string s1 = "hello";
    string s2 = "héllo";
    
    if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace) == 0)
    {
        // both strings are equal
    }
    

    or if you want the comparison to be case insensitive as well

    string s1 = "HEllO";
    string s2 = "héLLo";
    
    if (String.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase) == 0)
    {
        // both strings are equal
    }
    
    0 讨论(0)
  • 2020-11-22 11:11

    EDIT 2012-01-20: Oh boy! The solution was so much simpler and has been in the framework nearly forever. As pointed out by knightpfhor :

    string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);
    

    Here's a function that strips diacritics from a string:

    static string RemoveDiacritics(string text)
    {
      string formD = text.Normalize(NormalizationForm.FormD);
      StringBuilder sb = new StringBuilder();
    
      foreach (char ch in formD)
      {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
        if (uc != UnicodeCategory.NonSpacingMark)
        {
          sb.Append(ch);
        }
      }
    
      return sb.ToString().Normalize(NormalizationForm.FormC);
    }
    

    More details on MichKap's blog (RIP...).

    The principle is that is it turns 'é' into 2 successive chars 'e', acute. It then iterates through the chars and skips the diacritics.

    "héllo" becomes "he<acute>llo", which in turn becomes "hello".

    Debug.Assert("hello"==RemoveDiacritics("héllo"));
    

    Note: Here's a more compact .NET4+ friendly version of the same function:

    static string RemoveDiacritics(string text)
    {
      return string.Concat( 
          text.Normalize(NormalizationForm.FormD)
          .Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
                                        UnicodeCategory.NonSpacingMark)
        ).Normalize(NormalizationForm.FormC);
    }
    
    0 讨论(0)
  • 2020-11-22 11:11

    The following method CompareIgnoreAccents(...) works on your example data. Here is the article where I got my background information: http://www.codeproject.com/KB/cs/EncodingAccents.aspx

    private static bool CompareIgnoreAccents(string s1, string s2)
    {
        return string.Compare(
            RemoveAccents(s1), RemoveAccents(s2), StringComparison.InvariantCultureIgnoreCase) == 0;
    }
    
    private static string RemoveAccents(string s)
    {
        Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
    
        return destEncoding.GetString(
            Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
    }
    

    I think an extension method would be better:

    public static string RemoveAccents(this string s)
    {
        Encoding destEncoding = Encoding.GetEncoding("iso-8859-8");
    
        return destEncoding.GetString(
            Encoding.Convert(Encoding.UTF8, destEncoding, Encoding.UTF8.GetBytes(s)));
    }
    

    Then the use would be this:

    if(string.Compare(s1.RemoveAccents(), s2.RemoveAccents(), true) == 0) {
       ...
    
    0 讨论(0)
  • 2020-11-22 11:13

    I had to do something similar but with a StartsWith method. Here is a simple solution derived from @Serge - appTranslator.

    Here is an extension method:

        public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
        {
            if (str.Length >= value.Length)
                return string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
            else
                return false;            
        }
    

    And for one liners freaks ;)

        public static bool StartsWith(this string str, string value, CultureInfo culture, CompareOptions options)
        {
            return str.Length >= value.Length && string.Compare(str.Substring(0, value.Length), value, culture, options) == 0;
        }
    

    Accent incensitive and case incensitive startsWith can be called like this

    value.ToString().StartsWith(str, CultureInfo.InvariantCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase)
    
    0 讨论(0)
  • 2020-11-22 11:16

    A more simple way to remove accents:

        Dim source As String = "áéíóúç"
        Dim result As String
    
        Dim bytes As Byte() = Encoding.GetEncoding("Cyrillic").GetBytes(source)
        result = Encoding.ASCII.GetString(bytes)
    
    0 讨论(0)
提交回复
热议问题