Why doesn't Đ get flattened to D when Removing Accents/Diacritics

后端未结

关注

 5  1889

I\'m using this method to remove accents from my strings:

static string RemoveAccents(string input)
{
    string normalized = input.Normalize(NormalizationFo


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-04 01:48
              
            
            
                                                                       
this should work

    private static String RemoveDiacritics(string text)
    {
        String normalized = text.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < normalized.Length; i++)
        {
            Char c = normalized[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                sb.Append(c);
        }

        return sb.ToString();
    }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2021-01-04 02:00
              
            
            
                                                                       
I have to admit that I'm not sure why this works but it sure seems to

var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str)); 


=> "aoaaaaalccceeeeiiddnnooooruuuuyt"
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-01-04 02:01
              
            
            
                                                                       
string.Normalize(NormalizationForm) is an easy way to remove 'real' diacricits (Wiki) but many letters you may want to convert are not affected by this.
I had simmilar problems with Ð & ð (letter Eth), đ, Æ & æ. To convert them into ANSI (Latin) use Unicode-conversion instead!
    private static char[] ConvertUnicodeStringToSpecificEncoding(string input, int resultEncodingCode)
    {
        System.Text.Encoding unicodeEncoding = System.Text.Encoding.Unicode;
        System.Text.Encoding specificEncoding = System.Text.Encoding.GetEncoding(resultEncodingCode);

        byte[] convertedBytes = System.Text.Encoding.Convert(unicodeEncoding, specificEncoding, unicodeEncoding.GetBytes(input));
        char[] convertedChars = new char[specificEncoding.GetCharCount(convertedBytes, 0, convertedBytes.Length)];
        specificEncoding.GetChars(convertedBytes, 0, convertedBytes.Length, convertedChars, 0);
        return convertedChars;
    }

Call this method with multiple encoding on the same string to create an intersection on the letters you want to have left.
List of encodings:
https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
My solution looks like this
    // Encoding Types (int Codes) https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=netframework-4.8
    private static readonly char[] charactersToSkip = new char[] { 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü' };
    private static readonly char[] specialCharsToSkip = new char[] { '^', '´', '`', '°', '!', '\'', '§', '$', '%', '&', '/', '(', ')', '=', '{', '[', ']', '}', '\\', '+', '-' };
    private static readonly char[] ambiguousCharsToSkip = new char[] { '?' };   // Chars which might be a result of encoding-conversion and have to be skipped beforehand.
    private static readonly int[] encodingsToRemoveDiacritics = new int[]
    {
        852,    // 852  ibm852  Central European (DOS)
        850,    // 850  ibm850  Western European (DOS)
        860,    // 860  IBM860  Portuguese (DOS)    

        /* Warning:
         * Only append encodings.
         * Changing sort order of encodings may result in malfunctioning.
         */ 
    };

    public static string RemoveDiacritics(this string inputString)
    {
        if (string.IsNullOrEmpty(inputString))
        {
            return inputString;
        }

        var resultStringBuilder = new StringBuilder();

        foreach (char currentChar in inputString)
        {
            if (charactersToSkip.Contains(currentChar) || specialCharsToSkip.Contains(currentChar) || ambiguousCharsToSkip.Contains(currentChar))
            {
                resultStringBuilder.Append(currentChar);
                continue;
            }

            string normalizedString = currentChar.ToString().Normalize(NormalizationForm.FormD);
            foreach (char normalizedChar in normalizedString)
            {
                if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(normalizedChar) != System.Globalization.UnicodeCategory.NonSpacingMark)
                {
                    string convertedString = normalizedChar.ToString();
                    char[] convertedChars = null;

                    foreach (int encodingCode in encodingsToRemoveDiacritics)
                    {
                        convertedChars = ConvertUnicodeStringToSpecificEncoding(convertedString, encodingCode);

                        if (convertedChars.Contains('?') == false)
                        {
                            convertedString = new string(convertedChars);
                        }
                    }

                    resultStringBuilder.Append(convertedString);
                }
            }
        }

        return resultStringBuilder.ToString();
    }

which creates following outputs
"abcdefghijklmnopqrstuvwxzy" -> "abcdefghijklmnopqrstuvwxzy"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ" -> "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"1234567890" -> "1234567890"
"ß" -> "ß"
"ÄÖÜ" -> "ÄÖÜ"
"äöü" -> "äöü"
"!\"§$%&/()=?" -> "!\"§$%&/()=?"
"+-_~'*#" -> "+-_~'*#"
",.;:" -> ",.;:"
"µ" -> "u" // My -> u
"<>|" -> "<>|"
"´`^°" -> "´`^°"
"²" -> "2" // ² -> 2
"³" -> "3" // ³ -> 3
"{}" -> "{}"
"[]" -> "[]"
"\\" -> "\\"
"áàâã" -> "aaaa"
"ÁÀÂÅ" -> "AAAA"
"éèêę" -> "eeee"
"ÉÈÊĚ" -> "EEEE"
"íìîï" -> "iiii"
"ÍÌÎ" -> "III"
"óòôõ" -> "oooo"
"ÓÒÔŌ" -> "OOOO"
"úùû" -> "uuu"
"ÚÙÛ" -> "UUU"
"ÇĆĈČĊ" -> "CCCCC"
"çćĉčċ" -> "ccccc"
"Ñ" -> "N"
"Æ" -> "A"
"æ" -> "a"
"ýÿ" -> "yy"
"ĹĻĽ" -> "LLL"
"Ð" -> "D"
"đ" -> "d"
"ð" -> "d"

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-01-04 02:02
              
            
            
                                                                       
"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-01-04 02:14
              
            
            
                                                                       
The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).

"đ".Normalize(NormalizationForm.FormD) simply returns "đ", which is not stripped out by the loop because it is not a non-spacing mark.

A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this  approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复