How to compare Unicode characters that “look alike”?

前端 未结 10 985
情歌与酒
情歌与酒 2020-11-27 10:42

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if

相关标签:
10条回答
  • 2020-11-27 11:31

    Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).

    References:

    • Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
    • Unicode Character 'MICRO SIGN' (U+00B5)

    So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

    public void Main()
    {
        var s1 = "μ";
        var s2 = "µ";
    
        Console.WriteLine(s1.Equals(s2));  // false
        Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true 
    }
    
    static string RemoveDiacritics(string text) 
    {
        var normalizedString = text.Normalize(NormalizationForm.FormKC);
        var stringBuilder = new StringBuilder();
    
        foreach (var c in normalizedString)
        {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (unicodeCategory != UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }
    
        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
    }
    

    And the Demo

    0 讨论(0)
  • 2020-11-27 11:32

    It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.

    Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

    0 讨论(0)
  • 2020-11-27 11:33

    In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

    For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

    Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

    This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

    So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

    using System;
    using System.Text;
    
    class Program
    {
        static void Main(string[] args)
        {
            char first = 'μ';
            char second = 'µ';
    
            // Technically you only need to normalize U+00B5 to obtain U+03BC, but
            // if you're unsure which character is which, you can safely normalize both
            string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
            string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);
    
            Console.WriteLine(first.Equals(second));                     // False
            Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
        }
    }
    

    For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

    0 讨论(0)
  • 2020-11-27 11:39

    They both have different character codes: Refer this for more details

    Console.WriteLine((int)'μ');  //956
    Console.WriteLine((int)'µ');  //181
    

    Where, 1st one is:

    Display     Friendly Code   Decimal Code    Hex Code    Description
    ====================================================================
    μ           &mu;            &#956;          &#x3BC;     Lowercase Mu
    µ           &micro;         &#181;          &#xB5;      micro sign Mu
    

    Image

    0 讨论(0)
提交回复
热议问题