How to compare Unicode characters that “look alike”?

前端未结

关注

 10  985

情歌与酒

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if

相关标签:

10条回答

走了就别回头了

2020-11-27 11:31

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).

References:

Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
Unicode Character 'MICRO SIGN' (U+00B5)

So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

public void Main()
{
    var s1 = "μ";
    var s2 = "µ";

    Console.WriteLine(s1.Equals(s2));  // false
    Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true 
}

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormKC);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

And the Demo

0 讨论(0)

囚心锁ツ

2020-11-27 11:32

It's possible to draw both of chars with the same font style and size with DrawString method. After two bitmaps with symbols has been generated, it's possible to compare them pixel by pixel.

Advantage of this method is that you can compare not only absolute equal charcters, but similar too (with definite tolerance).

0 讨论(0)
发布评论:

提交评论
- 加载中...
粉色の甜心

2020-11-27 11:33
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
```
using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        char first = 'μ';
        char second = 'µ';

        // Technically you only need to normalize U+00B5 to obtain U+03BC, but
        // if you're unsure which character is which, you can safely normalize both
        string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
        string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);

        Console.WriteLine(first.Equals(second));                     // False
        Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
    }
}
```
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.
0 讨论(0)
发布评论:

提交评论
- 加载中...

旧时难觅i

2020-11-27 11:39

They both have different character codes: Refer this for more details

Console.WriteLine((int)'μ');  //956
Console.WriteLine((int)'µ');  //181

Where, 1st one is:

Display     Friendly Code   Decimal Code    Hex Code    Description
====================================================================
μ           &mu;            &#956;          &#x3BC;     Lowercase Mu
µ           &micro;         &#181;          &#xB5;      micro sign Mu

0 讨论(0)

上一页 1 2