How to compare Unicode characters that “look alike”?

前端 未结 10 984
情歌与酒
情歌与酒 2020-11-27 10:42

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if

相关标签:
10条回答
  • 2020-11-27 11:13

    You ask "how to compare them" but you don't tell us what you want to do.

    There are at least two main ways to compare them:

    Either you compare them directly as you are and they are different

    Or you use Unicode Compatibility Normalization if your need is for a comparison that finds them to match.

    There could be a problem though because Unicode compatibility normalization will make many other characters compare equal. If you want only these two characters to be treated as alike you should roll your own normalization or comparison functions.

    For a more specific solution we need to know your specific problem. What is the context under which you came across this problem?

    0 讨论(0)
  • 2020-11-27 11:14

    For the specific example of μ (mu) and µ (micro sign), the latter has a compatibility decomposition to the former, so you can normalize the string to FormKC or FormKD to convert the micro signs to mus.

    However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks. If necessary, you could parse this file and build a table for “visual normalization” of strings.

    0 讨论(0)
  • 2020-11-27 11:15

    EDIT After the merge of this question with How to compare 'μ' and 'µ' in C#
    Original answer posted:

     "μ".ToUpper().Equals("µ".ToUpper()); //This always return true.
    

    EDIT After reading the comments, yes it is not good to use the above method because it may provide wrong results for some other type of inputs, for this we should use normalize using full compatibility decomposition as mentioned in wiki. (Thanks to the answer posted by BoltClock)

        static string GREEK_SMALL_LETTER_MU = new String(new char[] { '\u03BC' });
        static string MICRO_SIGN = new String(new char[] { '\u00B5' });
    
        public static void Main()
        {
            string Mus = "µμ";
            string NormalizedString = null;
            int i = 0;
            do
            {
                string OriginalUnicodeString = Mus[i].ToString();
                if (OriginalUnicodeString.Equals(GREEK_SMALL_LETTER_MU))
                    Console.WriteLine(" INFORMATIO ABOUT GREEK_SMALL_LETTER_MU");
                else if (OriginalUnicodeString.Equals(MICRO_SIGN))
                    Console.WriteLine(" INFORMATIO ABOUT MICRO_SIGN");
    
                Console.WriteLine();
                ShowHexaDecimal(OriginalUnicodeString);                
                Console.WriteLine("Unicode character category " + CharUnicodeInfo.GetUnicodeCategory(Mus[i]));
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormC);
                Console.Write("Form C Normalized: ");
                ShowHexaDecimal(NormalizedString);               
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormD);
                Console.Write("Form D Normalized: ");
                ShowHexaDecimal(NormalizedString);               
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKC);
                Console.Write("Form KC Normalized: ");
                ShowHexaDecimal(NormalizedString);                
    
                NormalizedString = OriginalUnicodeString.Normalize(NormalizationForm.FormKD);
                Console.Write("Form KD Normalized: ");
                ShowHexaDecimal(NormalizedString);                
                Console.WriteLine("_______________________________________________________________");
                i++;
            } while (i < 2);
            Console.ReadLine();
        }
    
        private static void ShowHexaDecimal(string UnicodeString)
        {
            Console.Write("Hexa-Decimal Characters of " + UnicodeString + "  are ");
            foreach (short x in UnicodeString.ToCharArray())
            {
                Console.Write("{0:X4} ", x);
            }
            Console.WriteLine();
        }
    

    Output

    INFORMATIO ABOUT MICRO_SIGN    
    Hexa-Decimal Characters of µ  are 00B5
    Unicode character category LowercaseLetter
    Form C Normalized: Hexa-Decimal Characters of µ  are 00B5
    Form D Normalized: Hexa-Decimal Characters of µ  are 00B5
    Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
     ________________________________________________________________
     INFORMATIO ABOUT GREEK_SMALL_LETTER_MU    
    Hexa-Decimal Characters of µ  are 03BC
    Unicode character category LowercaseLetter
    Form C Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form D Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KC Normalized: Hexa-Decimal Characters of µ  are 03BC
    Form KD Normalized: Hexa-Decimal Characters of µ  are 03BC
     ________________________________________________________________
    

    While reading information in Unicode_equivalence I found

    The choice of equivalence criteria can affect search results. For instance some typographic ligatures like U+FB03 (ffi), ..... so a search for U+0066 (f) as substring would succeed in an NFKC normalization of U+FB03 but not in NFC normalization of U+FB03.

    So to compare equivalence we should normally use FormKC i.e. NFKC normalization or FormKD i.e NFKD normalization.
    I was little curious to know more about all the Unicode characters so I made sample which would iterate over all the Unicode character in UTF-16 and I got some results I want to discuss

    • Information about characters whose FormC and FormD normalized values were not equivalent
      Total: 12,118
      Character (int value): 192-197, 199-207, 209-214, 217-221, 224-253, ..... 44032-55203
    • Information about characters whose FormKC and FormKD normalized values were not equivalent
      Total: 12,245
      Character (int value): 192-197, 199-207, 209-214, 217-221, 224-228, ..... 44032-55203, 64420-64421, 64432-64433, 64490-64507, 64512-64516, 64612-64617, 64663-64667, 64735-64736, 65153-65164, 65269-65274
    • All the character whose FormC and FormD normalized value were not equivalent, there FormKC and FormKD normalized values were also not equivalent except these characters
      Characters: 901 '΅', 8129 '῁', 8141 '῍', 8142 '῎', 8143 '῏', 8157 '῝', 8158 '῞'
      , 8159 '῟', 8173 '῭', 8174 '΅'
    • Extra character whose FormKC and FormKD normalized value were not equivalent, but there FormC and FormD normalized values were equivalent
      Total: 119
      Characters: 452 'DŽ' 453 'Dž' 454 'dž' 12814 '㈎' 12815 '㈏' 12816 '㈐' 12817 '㈑' 12818 '㈒' 12819 '㈓' 12820 '㈔' 12821 '㈕', 12822 '㈖' 12823 '㈗' 12824 '㈘' 12825 '㈙' 12826 '㈚' 12827 '㈛' 12828 '㈜' 12829 '㈝' 12830 '㈞' 12910 '㉮' 12911 '㉯' 12912 '㉰' 12913 '㉱' 12914 '㉲' 12915 '㉳' 12916 '㉴' 12917 '㉵' 12918 '㉶' 12919 '㉷' 12920 '㉸' 12921 '㉹' 12922 '㉺' 12923 '㉻' 12924 '㉼' 12925 '㉽' 12926 '㉾' 13056 '㌀' 13058 '㌂' 13060 '㌄' 13063 '㌇' 13070 '㌎' 13071 '㌏' 13072 '㌐' 13073 '㌑' 13075 '㌓' 13077 '㌕' 13080 '㌘' 13081 '㌙' 13082 '㌚' 13086 '㌞' 13089 '㌡' 13092 '㌤' 13093 '㌥' 13094 '㌦' 13099 '㌫' 13100 '㌬' 13101 '㌭' 13102 '㌮' 13103 '㌯' 13104 '㌰' 13105 '㌱' 13106 '㌲' 13108 '㌴' 13111 '㌷' 13112 '㌸' 13114 '㌺' 13115 '㌻' 13116 '㌼' 13117 '㌽' 13118 '㌾' 13120 '㍀' 13130 '㍊' 13131 '㍋' 13132 '㍌' 13134 '㍎' 13139 '㍓' 13140 '㍔' 13142 '㍖' .......... ﺋ' 65164 'ﺌ' 65269 'ﻵ' 65270 'ﻶ' 65271 'ﻷ' 65272 'ﻸ' 65273 'ﻹ' 65274'
    • There are some characters which can not be normalized, they throw ArgumentException if tried
      Total:2081 Characters(int value): 55296-57343, 64976-65007, 65534

    This links can be really helpful to understand what rules govern for Unicode equivalence

    1. Unicode_equivalence
    2. Unicode_compatibility_characters
    0 讨论(0)
  • 2020-11-27 11:22

    If I would like to be pedantic, I would say that your question doesn't make sense, but since we are approaching christmas and the birds are singing, I'll proceed with this.

    First off, the 2 entities that you are trying to compare are glyphs, a glyph is part of a set of glyphs provided by what is usually know as a "font", the thing that usually comes in a ttf, otf or whatever file format you are using.

    The glyphs are a representation of a given symbol, and since they are a representation that depends on a specific set, you can't just expect to have 2 similar or even "better" identical symbols, it's a phrase that doesn't make sense if you consider the context, you should at least specify what font or set of glyphs you are considering when you formulate a question like this.

    What is usually used to solve a problem similar to the one that you are encountering, it's an OCR, essentially a software that recognize and compares glyphs, If C# provides an OCR by default I don't know that, but it's generally a really bad idea if you don't really need an OCR and you know what to do with it.

    You can possibly end up interpreting a physics book as an ancient greek book without mentioning the fact that OCR are generally expensive in terms of resources.

    There is a reason why those characters are localized the way they are localized, just don't do that.

    0 讨论(0)
  • 2020-11-27 11:23

    Most likely, there are two different character codes that make (visibly) the same character. While technically not equal, they look equal. Have a look at the character table and see whether there are multiple instances of that character. Or print out the character code of the two chars in your code.

    0 讨论(0)
  • 2020-11-27 11:24

    Search both characters in a Unicode database and see the difference.

    One is the Greek small Letter µ and the other is the Micro Sign µ.

    Name            : MICRO SIGN
    Block           : Latin-1 Supplement
    Category        : Letter, Lowercase [Ll]
    Combine         : 0
    BIDI            : Left-to-Right [L]
    Decomposition   : <compat> GREEK SMALL LETTER MU (U+03BC)
    Mirror          : N
    Index entries   : MICRO SIGN
    Upper case      : U+039C
    Title case      : U+039C
    Version         : Unicode 1.1.0 (June, 1993)
    

    Name            : GREEK SMALL LETTER MU
    Block           : Greek and Coptic
    Category        : Letter, Lowercase [Ll]
    Combine         : 0
    BIDI            : Left-to-Right [L]
    Mirror          : N
    Upper case      : U+039C
    Title case      : U+039C
    See Also        : micro sign U+00B5
    Version         : Unicode 1.1.0 (June, 1993)
    
    0 讨论(0)
提交回复
热议问题