How do I remove diacritics (accents) from a string in .NET?

前端 未结 20 2846
南方客
南方客 2020-11-21 05:44

I\'m trying to convert some strings that are in French Canadian and basically, I\'d like to be able to take out the French accent marks in the letters while keeping the lett

相关标签:
20条回答
  • 2020-11-21 05:52

    I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

    static string RemoveDiacritics(string text) 
    {
        var normalizedString = text.Normalize(NormalizationForm.FormD);
        var stringBuilder = new StringBuilder();
    
        foreach (var c in normalizedString)
        {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (unicodeCategory != UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }
    
        return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
    }
    

    Note that this is a followup to his earlier post: Stripping diacritics....

    The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

    Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

    0 讨论(0)
  • 2020-11-21 05:55

    In case anyone's interested, here is the java equivalent:

    import java.text.Normalizer;
    
    public class MyClass
    {
        public static String removeDiacritics(String input)
        {
            String nrml = Normalizer.normalize(input, Normalizer.Form.NFD);
            StringBuilder stripped = new StringBuilder();
            for (int i=0;i<nrml.length();++i)
            {
                if (Character.getType(nrml.charAt(i)) != Character.NON_SPACING_MARK)
                {
                    stripped.append(nrml.charAt(i));
                }
            }
            return stripped.toString();
        }
    }
    
    0 讨论(0)
  • 2020-11-21 05:56

    It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

    Since the original question was related to French, the simplest working answer is indeed

        public static string ConvertWesternEuropeanToASCII(this string str)
        {
            return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
        }
    

    1251 should be replaced by the encoding code of the input language.

    This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

        public static string LatinizeGermanCharacters(this string str)
        {
            StringBuilder sb = new StringBuilder(str.Length);
            foreach (char c in str)
            {
                switch (c)
                {
                    case 'ä':
                        sb.Append("ae");
                        break;
                    case 'ö':
                        sb.Append("oe");
                        break;
                    case 'ü':
                        sb.Append("ue");
                        break;
                    case 'Ä':
                        sb.Append("Ae");
                        break;
                    case 'Ö':
                        sb.Append("Oe");
                        break;
                    case 'Ü':
                        sb.Append("Ue");
                        break;
                    case 'ß':
                        sb.Append("ss");
                        break;
                    default:
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString();
        }
    

    It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

    I also have a very simple method to remove space:

        public static string RemoveSpace(this string str)
        {
            return str.Replace(" ", string.Empty);
        }
    

    Eventually, I am using a combination of all 3 above extensions:

        public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
        {
            str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
            return keepSpace ? str : str.RemoveSpace();
        }
    

    And a small unit test to that (not exhaustive) which pass successfully.

        [TestMethod()]
        public void LatinizeAndConvertToASCIITest()
        {
            string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
            string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
            string actual = europeanStr.LatinizeAndConvertToASCII();
            Assert.AreEqual(expected, actual);
        }
    
    0 讨论(0)
  • 2020-11-21 05:57

    I often use an extenstion method based on another version I found here (see Replacing characters in C# (ascii)) A quick explanation:

    • Normalizing to form D splits charactes like è to an e and a nonspacing `
    • From this, the nospacing characters are removed
    • The result is normalized back to form C (I'm not sure if this is neccesary)

    Code:

    using System.Linq;
    using System.Text;
    using System.Globalization;
    
    // namespace here
    public static class Utility
    {
        public static string RemoveDiacritics(this string str)
        {
            if (null == str) return null;
            var chars =
                from c in str.Normalize(NormalizationForm.FormD).ToCharArray()
                let uc = CharUnicodeInfo.GetUnicodeCategory(c)
                where uc != UnicodeCategory.NonSpacingMark
                select c;
    
            var cleanStr = new string(chars.ToArray()).Normalize(NormalizationForm.FormC);
    
            return cleanStr;
        }
    
        // or, alternatively
        public static string RemoveDiacritics2(this string str)
        {
            if (null == str) return null;
            var chars = str
                .Normalize(NormalizationForm.FormD)
                .ToCharArray()
                .Where(c=> CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
                .ToArray();
    
            return new string(chars).Normalize(NormalizationForm.FormC);
        }
    }
    
    0 讨论(0)
  • 2020-11-21 05:59

    This is how i replace diacritic characters to non-diacritic ones in all my .NET program

    C#:

    //Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter 'é' is substituted by an 'e'
    public string RemoveDiacritics(string s)
    {
        string normalizedString = null;
        StringBuilder stringBuilder = new StringBuilder();
        normalizedString = s.Normalize(NormalizationForm.FormD);
        int i = 0;
        char c = '\0';
    
        for (i = 0; i <= normalizedString.Length - 1; i++)
        {
            c = normalizedString[i];
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            {
                stringBuilder.Append(c);
            }
        }
    
        return stringBuilder.ToString().ToLower();
    }
    

    VB .NET:

    'Transforms the culture of a letter to its equivalent representation in the 0-127 ascii table, such as the letter "é" is substituted by an "e"'
    Public Function RemoveDiacritics(ByVal s As String) As String
        Dim normalizedString As String
        Dim stringBuilder As New StringBuilder
        normalizedString = s.Normalize(NormalizationForm.FormD)
        Dim i As Integer
        Dim c As Char
    
        For i = 0 To normalizedString.Length - 1
            c = normalizedString(i)
            If CharUnicodeInfo.GetUnicodeCategory(c) <> UnicodeCategory.NonSpacingMark Then
                stringBuilder.Append(c)
            End If
        Next
        Return stringBuilder.ToString().ToLower()
    End Function
    
    0 讨论(0)
  • 2020-11-21 05:59
    Imports System.Text
    Imports System.Globalization
    
     Public Function DECODE(ByVal x As String) As String
            Dim sb As New StringBuilder
            For Each c As Char In x.Normalize(NormalizationForm.FormD).Where(Function(a) CharUnicodeInfo.GetUnicodeCategory(a) <> UnicodeCategory.NonSpacingMark)  
                sb.Append(c)
            Next
            Return sb.ToString()
        End Function
    
    0 讨论(0)
提交回复
热议问题