How do I remove diacritics (accents) from a string in .NET?

前端 未结 20 2859
南方客
南方客 2020-11-21 05:44

I\'m trying to convert some strings that are in French Canadian and basically, I\'d like to be able to take out the French accent marks in the letters while keeping the lett

20条回答
  •  广开言路
    2020-11-21 06:00

    Not having enough reputations, apparently I can not comment on Alexander's excellent link. - Lucene appears to be the only solution working in reasonably generic cases.

    For those wanting a simple copy-paste solution, here it is, leveraging code in Lucene:

    string testbed = "ÁÂÄÅÇÉÍÎÓÖØÚÜÞàáâãäåæçèéêëìíîïðñóôöøúüāăčĐęğıŁłńŌōřŞşšźžșțệủ";

    Console.WriteLine(Lucene.latinizeLucene(testbed));

    AAAACEIIOOOUUTHaaaaaaaeceeeeiiiidnoooouuaacDegiLlnOorSsszzsteu

    //////////

    public static class Lucene
    {
        // source: https://raw.githubusercontent.com/apache/lucenenet/master/src/Lucene.Net.Analysis.Common/Analysis/Miscellaneous/ASCIIFoldingFilter.cs
        // idea: https://stackoverflow.com/questions/249087/how-do-i-remove-diacritics-accents-from-a-string-in-net (scroll down, search for lucene by Alexander)
        public static string latinizeLucene(string arg)
        {
            char[] argChar = arg.ToCharArray();
    
            // latinizeLuceneImpl can expand one char up to four chars - e.g. Þ to TH, or æ to ae, or in fact ⑽ to (10)
            char[] resultChar = new String(' ', arg.Length * 4).ToCharArray();
    
            int outputPos = Lucene.latinizeLuceneImpl(argChar, 0, ref resultChar, 0, arg.Length);
    
            string ret = new string(resultChar);
            ret = ret.Substring(0, outputPos);
    
            return ret;
        }
    
        /// 
        /// Converts characters above ASCII to their ASCII equivalents.  For example,
        /// accents are removed from accented characters. 
        /// 
        /// @lucene.internal
        /// 
        ///      The characters to fold 
        ///   Index of the first character to fold 
        ///     The result of the folding. Should be of size >= length * 4. 
        ///  Index of output where to put the result of the folding 
        ///     The number of characters to fold 
        ///  length of output 
        private static int latinizeLuceneImpl(char[] input, int inputPos, ref char[] output, int outputPos, int length)
        {
            int end = inputPos + length;
            for (int pos = inputPos; pos < end; ++pos)
            {
                char c = input[pos];
    
                // Quick test: if it's not in range then just keep current character
                if (c < '\u0080')
                {
                    output[outputPos++] = c;
                }
                else
                {
                    switch (c)
                    {
                        case '\u00C0': // À  [LATIN CAPITAL LETTER A WITH GRAVE]
                        case '\u00C1': // Á  [LATIN CAPITAL LETTER A WITH ACUTE]
                        case '\u00C2': // Â  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX]
                        case '\u00C3': // Ã  [LATIN CAPITAL LETTER A WITH TILDE]
                        case '\u00C4': // Ä  [LATIN CAPITAL LETTER A WITH DIAERESIS]
                        case '\u00C5': // Å  [LATIN CAPITAL LETTER A WITH RING ABOVE]
                        case '\u0100': // Ā  [LATIN CAPITAL LETTER A WITH MACRON]
                        case '\u0102': // Ă  [LATIN CAPITAL LETTER A WITH BREVE]
                        case '\u0104': // Ą  [LATIN CAPITAL LETTER A WITH OGONEK]
                        case '\u018F': // Ə  http://en.wikipedia.org/wiki/Schwa  [LATIN CAPITAL LETTER SCHWA]
                        case '\u01CD': // Ǎ  [LATIN CAPITAL LETTER A WITH CARON]
                        case '\u01DE': // Ǟ  [LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON]
                        case '\u01E0': // Ǡ  [LATIN CAPITAL LETTER A WITH DOT ABOVE AND MACRON]
                        case '\u01FA': // Ǻ  [LATIN CAPITAL LETTER A WITH RING ABOVE AND ACUTE]
                        case '\u0200': // Ȁ  [LATIN CAPITAL LETTER A WITH DOUBLE GRAVE]
                        case '\u0202': // Ȃ  [LATIN CAPITAL LETTER A WITH INVERTED BREVE]
                        case '\u0226': // Ȧ  [LATIN CAPITAL LETTER A WITH DOT ABOVE]
                        case '\u023A': // Ⱥ  [LATIN CAPITAL LETTER A WITH STROKE]
                        case '\u1D00': // ᴀ  [LATIN LETTER SMALL CAPITAL A]
                        case '\u1E00': // Ḁ  [LATIN CAPITAL LETTER A WITH RING BELOW]
                        case '\u1EA0': // Ạ  [LATIN CAPITAL LETTER A WITH DOT BELOW]
                        case '\u1EA2': // Ả  [LATIN CAPITAL LETTER A WITH HOOK ABOVE]
                        case '\u1EA4': // Ấ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND ACUTE]
                        case '\u1EA6': // Ầ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND GRAVE]
                        case '\u1EA8': // Ẩ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                        case '\u1EAA': // Ẫ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND TILDE]
                        case '\u1EAC': // Ậ  [LATIN CAPITAL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                        case '\u1EAE': // Ắ  [LATIN CAPITAL LETTER A WITH BREVE AND ACUTE]
                        case '\u1EB0': // Ằ  [LATIN CAPITAL LETTER A WITH BREVE AND GRAVE]
                        case '\u1EB2': // Ẳ  [LATIN CAPITAL LETTER A WITH BREVE AND HOOK ABOVE]
                        case '\u1EB4': // Ẵ  [LATIN CAPITAL LETTER A WITH BREVE AND TILDE]
                        case '\u1EB6': // Ặ  [LATIN CAPITAL LETTER A WITH BREVE AND DOT BELOW]
                        case '\u24B6': // Ⓐ  [CIRCLED LATIN CAPITAL LETTER A]
                        case '\uFF21': // A  [FULLWIDTH LATIN CAPITAL LETTER A]
                            output[outputPos++] = 'A';
                            break;
                        case '\u00E0': // à  [LATIN SMALL LETTER A WITH GRAVE]
                        case '\u00E1': // á  [LATIN SMALL LETTER A WITH ACUTE]
                        case '\u00E2': // â  [LATIN SMALL LETTER A WITH CIRCUMFLEX]
                        case '\u00E3': // ã  [LATIN SMALL LETTER A WITH TILDE]
                        case '\u00E4': // ä  [LATIN SMALL LETTER A WITH DIAERESIS]
                        case '\u00E5': // å  [LATIN SMALL LETTER A WITH RING ABOVE]
                        case '\u0101': // ā  [LATIN SMALL LETTER A WITH MACRON]
                        case '\u0103': // ă  [LATIN SMALL LETTER A WITH BREVE]
                        case '\u0105': // ą  [LATIN SMALL LETTER A WITH OGONEK]
                        case '\u01CE': // ǎ  [LATIN SMALL LETTER A WITH CARON]
                        case '\u01DF': // ǟ  [LATIN SMALL LETTER A WITH DIAERESIS AND MACRON]
                        case '\u01E1': // ǡ  [LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON]
                        case '\u01FB': // ǻ  [LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE]
                        case '\u0201': // ȁ  [LATIN SMALL LETTER A WITH DOUBLE GRAVE]
                        case '\u0203': // ȃ  [LATIN SMALL LETTER A WITH INVERTED BREVE]
                        case '\u0227': // ȧ  [LATIN SMALL LETTER A WITH DOT ABOVE]
                        case '\u0250': // ɐ  [LATIN SMALL LETTER TURNED A]
                        case '\u0259': // ə  [LATIN SMALL LETTER SCHWA]
                        case '\u025A': // ɚ  [LATIN SMALL LETTER SCHWA WITH HOOK]
                        case '\u1D8F': // ᶏ  [LATIN SMALL LETTER A WITH RETROFLEX HOOK]
                        case '\u1D95': // ᶕ  [LATIN SMALL LETTER SCHWA WITH RETROFLEX HOOK]
                        case '\u1E01': // ạ  [LATIN SMALL LETTER A WITH RING BELOW]
                        case '\u1E9A': // ả  [LATIN SMALL LETTER A WITH RIGHT HALF RING]
                        case '\u1EA1': // ạ  [LATIN SMALL LETTER A WITH DOT BELOW]
                        case '\u1EA3': // ả  [LATIN SMALL LETTER A WITH HOOK ABOVE]
                        case '\u1EA5': // ấ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE]
                        case '\u1EA7': // ầ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE]
                        case '\u1EA9': // ẩ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE]
                        case '\u1EAB': // ẫ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE]
                        case '\u1EAD': // ậ  [LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW]
                        case '\u1EAF': // ắ  [LATIN SMALL LETTER A WITH BREVE AND ACUTE]
                        case '\u1EB1': // ằ  [LATIN SMALL LETTER A WITH BREVE AND GRAVE]
                        case '\u1EB3': // ẳ  [LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE]
                        case '\u1EB5': // ẵ  [LATIN SMALL LETTER A WITH BREVE AND TILDE]
                        case '\u1EB7': // ặ  [LATIN SMALL LETTER A WITH BREVE AND DOT BELOW]
                        case '\u2090': // ₐ  [LATIN SUBSCRIPT SMALL LETTER A]
                        case '\u2094': // ₔ  [LATIN SUBSCRIPT SMALL LETTER SCHWA]
                        case '\u24D0': // ⓐ  [CIRCLED LATIN SMALL LETTER A]
                        case '\u2C65': // ⱥ  [LATIN SMALL LETTER A WITH STROKE]
                        case '\u2C6F': // Ɐ  [LATIN CAPITAL LETTER TURNED A]
                        case '\uFF41': // a  [FULLWIDTH LATIN SMALL LETTER A]
                            output[outputPos++] = 'a';
                            break;
                        case '\uA732': // Ꜳ  [LATIN CAPITAL LETTER AA]
                            output[outputPos++] = 'A';
                            output[outputPos++] = 'A';
                            break;
                        case '\u00C6': // Æ  [LATIN CAPITAL LETTER AE]
                        case '\u01E2': // Ǣ  [LATIN CAPITAL LETTER AE WITH MACRON]
                        case '\u01FC': // Ǽ  [LATIN CAPITAL LETTER AE WITH ACUTE]
                        case '\u1D01': // ᴁ  [LATIN LETTER SMALL CAPITAL AE]
                            output[outputPos++] = 'A';
                            output[outputPos++] = 'E';
                            break;
                        case '\uA734': // Ꜵ  [LATIN CAPITAL LETTER AO]
                            output[outputPos++] = 'A';
                            output[outputPos++] = 'O';
                            break;
                        case '\uA736': // Ꜷ  [LATIN CAPITAL LETTER AU]
                            output[outputPos++] = 'A';
                            output[outputPos++] = 'U';
                            break;
    
            // etc. etc. etc.
            // see link above for complete source code
            // 
            // unfortunately, postings are limited, as in
            // "Body is limited to 30000 characters; you entered 136098."
    
                        [...]
    
                        case '\u2053': // ⁓  [SWUNG DASH]
                        case '\uFF5E': // ~  [FULLWIDTH TILDE]
                            output[outputPos++] = '~';
                            break;
                        default:
                            output[outputPos++] = c;
                            break;
                    }
                }
            }
            return outputPos;
        }
    }
    

提交回复
热议问题