How do I remove diacritics (accents) from a string in .NET?

前端 未结 20 2852
南方客
南方客 2020-11-21 05:44

I\'m trying to convert some strings that are in French Canadian and basically, I\'d like to be able to take out the French accent marks in the letters while keeping the lett

20条回答
  •  鱼传尺愫
    2020-11-21 05:56

    It's funny such a question can get so many answers, and yet none fit my requirements :) There are so many languages around, a full language agnostic solution is AFAIK not really possible, as others has mentionned that the FormC or FormD are giving issues.

    Since the original question was related to French, the simplest working answer is indeed

        public static string ConvertWesternEuropeanToASCII(this string str)
        {
            return Encoding.ASCII.GetString(Encoding.GetEncoding(1251).GetBytes(str));
        }
    

    1251 should be replaced by the encoding code of the input language.

    This however replace only one character by one character. Since I am also working with German as input, I did a manual convert

        public static string LatinizeGermanCharacters(this string str)
        {
            StringBuilder sb = new StringBuilder(str.Length);
            foreach (char c in str)
            {
                switch (c)
                {
                    case 'ä':
                        sb.Append("ae");
                        break;
                    case 'ö':
                        sb.Append("oe");
                        break;
                    case 'ü':
                        sb.Append("ue");
                        break;
                    case 'Ä':
                        sb.Append("Ae");
                        break;
                    case 'Ö':
                        sb.Append("Oe");
                        break;
                    case 'Ü':
                        sb.Append("Ue");
                        break;
                    case 'ß':
                        sb.Append("ss");
                        break;
                    default:
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString();
        }
    

    It might not deliver the best performance, but at least it is very easy to read and extend. Regex is a NO GO, much slower than any char/string stuff.

    I also have a very simple method to remove space:

        public static string RemoveSpace(this string str)
        {
            return str.Replace(" ", string.Empty);
        }
    

    Eventually, I am using a combination of all 3 above extensions:

        public static string LatinizeAndConvertToASCII(this string str, bool keepSpace = false)
        {
            str = str.LatinizeGermanCharacters().ConvertWesternEuropeanToASCII();            
            return keepSpace ? str : str.RemoveSpace();
        }
    

    And a small unit test to that (not exhaustive) which pass successfully.

        [TestMethod()]
        public void LatinizeAndConvertToASCIITest()
        {
            string europeanStr = "Bonjour ça va? C'est l'été! Ich möchte ä Ä á à â ê é è ë Ë É ï Ï î í ì ó ò ô ö Ö Ü ü ù ú û Û ý Ý ç Ç ñ Ñ";
            string expected = "Bonjourcava?C'estl'ete!IchmoechteaeAeaaaeeeeEEiIiiiooooeOeUeueuuuUyYcCnN";
            string actual = europeanStr.LatinizeAndConvertToASCII();
            Assert.AreEqual(expected, actual);
        }
    

提交回复
热议问题