Why doesn't Đ get flattened to D when Removing Accents/Diacritics

后端 未结 5 1889
你的背包
你的背包 2021-01-04 01:37

I\'m using this method to remove accents from my strings:

static string RemoveAccents(string input)
{
    string normalized = input.Normalize(NormalizationFo         


        
相关标签:
5条回答
  • 2021-01-04 01:48

    this should work

        private static String RemoveDiacritics(string text)
        {
            String normalized = text.Normalize(NormalizationForm.FormD);
            StringBuilder sb = new StringBuilder();
    
            for (int i = 0; i < normalized.Length; i++)
            {
                Char c = normalized[i];
                if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                    sb.Append(c);
            }
    
            return sb.ToString();
        }
    
    0 讨论(0)
  • 2021-01-04 02:00

    I have to admit that I'm not sure why this works but it sure seems to

    var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
    var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 
    

    => "aoaaaaalccceeeeiiddnnooooruuuuyt"

    0 讨论(0)
  • 2021-01-04 02:01

    string.Normalize(NormalizationForm) is an easy way to remove 'real' diacricits (Wiki) but many letters you may want to convert are not affected by this.

    I had simmilar problems with Ð & ð (letter Eth), đ, Æ & æ. To convert them into ANSI (Latin) use Unicode-conversion instead!

        private static char[] ConvertUnicodeStringToSpecificEncoding(string input, int resultEncodingCode)
        {
            System.Text.Encoding unicodeEncoding = System.Text.Encoding.Unicode;
            System.Text.Encoding specificEncoding = System.Text.Encoding.GetEncoding(resultEncodingCode);
    
            byte[] convertedBytes = System.Text.Encoding.Convert(unicodeEncoding, specificEncoding, unicodeEncoding.GetBytes(input));
            char[] convertedChars = new char[specificEncoding.GetCharCount(convertedBytes, 0, convertedBytes.Length)];
            specificEncoding.GetChars(convertedBytes, 0, convertedBytes.Length, convertedChars, 0);
            return convertedChars;
        }
    

    Call this method with multiple encoding on the same string to create an intersection on the letters you want to have left.

    List of encodings: https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8

    My solution looks like this

        // Encoding Types (int Codes) https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
        private static readonly char[] charactersToSkip = new char[] { 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü' };
        private static readonly char[] specialCharsToSkip = new char[] { '^', '´', '`', '°', '!', '\'', '§', '$', '%', '&', '/', '(', ')', '=', '{', '[', ']', '}', '\\', '+', '-' };
        private static readonly char[] ambiguousCharsToSkip = new char[] { '?' };   // Chars which might be a result of encoding-conversion and have to be skipped beforehand.
        private static readonly int[] encodingsToRemoveDiacritics = new int[]
        {
            852,    // 852  ibm852  Central European (DOS)
            850,    // 850  ibm850  Western European (DOS)
            860,    // 860  IBM860  Portuguese (DOS)    
    
            /* Warning:
             * Only append encodings.
             * Changing sort order of encodings may result in malfunctioning.
             */ 
        };
    
        public static string RemoveDiacritics(this string inputString)
        {
            if (string.IsNullOrEmpty(inputString))
            {
                return inputString;
            }
    
            var resultStringBuilder = new StringBuilder();
    
            foreach (char currentChar in inputString)
            {
                if (charactersToSkip.Contains(currentChar) || specialCharsToSkip.Contains(currentChar) || ambiguousCharsToSkip.Contains(currentChar))
                {
                    resultStringBuilder.Append(currentChar);
                    continue;
                }
    
                string normalizedString = currentChar.ToString().Normalize(NormalizationForm.FormD);
                foreach (char normalizedChar in normalizedString)
                {
                    if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(normalizedChar) != System.Globalization.UnicodeCategory.NonSpacingMark)
                    {
                        string convertedString = normalizedChar.ToString();
                        char[] convertedChars = null;
    
                        foreach (int encodingCode in encodingsToRemoveDiacritics)
                        {
                            convertedChars = ConvertUnicodeStringToSpecificEncoding(convertedString, encodingCode);
    
                            if (convertedChars.Contains('?') == false)
                            {
                                convertedString = new string(convertedChars);
                            }
                        }
    
                        resultStringBuilder.Append(convertedString);
                    }
                }
            }
    
            return resultStringBuilder.ToString();
        }
    

    which creates following outputs

    "abcdefghijklmnopqrstuvwxzy" -> "abcdefghijklmnopqrstuvwxzy"
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ" -> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
    "1234567890" -> "1234567890"
    "ß" -> "ß"
    "ÄÖÜ" -> "ÄÖÜ"
    "äöü" -> "äöü"
    "!\"§$%&/()=?" -> "!\"§$%&/()=?"
    "+-_~'*#" -> "+-_~'*#"
    ",.;:" -> ",.;:"
    "µ" -> "u" // My -> u
    "<>|" -> "<>|"
    "´`^°" -> "´`^°"
    "²" -> "2" // ² -> 2
    "³" -> "3" // ³ -> 3
    "{}" -> "{}"
    "[]" -> "[]"
    "\\" -> "\\"
    "áàâã" -> "aaaa"
    "ÁÀÂÅ" -> "AAAA"
    "éèêę" -> "eeee"
    "ÉÈÊĚ" -> "EEEE"
    "íìîï" -> "iiii"
    "ÍÌÎ" -> "III"
    "óòôõ" -> "oooo"
    "ÓÒÔŌ" -> "OOOO"
    "úùû" -> "uuu"
    "ÚÙÛ" -> "UUU"
    "ÇĆĈČĊ" -> "CCCCC"
    "çćĉčċ" -> "ccccc"
    "Ñ" -> "N"
    "Æ" -> "A"
    "æ" -> "a"
    "ýÿ" -> "yy"
    "ĹĻĽ" -> "LLL"
    "Ð" -> "D"
    "đ" -> "d"
    "ð" -> "d"
    
    0 讨论(0)
  • 2021-01-04 02:02

    "D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.

    0 讨论(0)
  • 2021-01-04 02:14

    The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

    "đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

    A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)

    0 讨论(0)
提交回复
热议问题