How to remove accents and tilde in a C++ std::string

前端 未结 8 1335
半阙折子戏
半阙折子戏 2020-12-15 21:26

I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not

相关标签:
8条回答
  • 2020-12-15 21:49
        /// <summary>
        /// 
        /// Replace any accent and foreign character by their ASCII equivalent.
        /// In other words, convert a string to an ASCII-complient string.
        /// 
        /// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
        /// 
        /// Tests with accents and foreign characters:
        /// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňʼnνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssIJIJijijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
        /// After:  "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
        /// 
        /// Tests with invalid 'special hidden characters':
        /// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
        /// After:  "00000Bjrk'\"\\\n\r oacu'\\'te"
        /// 
        /// </summary>
        private string Normalize(string StringToClean)
        {
            string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
            StringBuilder Buffer = new StringBuilder(StringToClean.Length);
    
            for (int i = 0; i < normalizedString.Length; i++)
            {
                if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
                {
                    Buffer.Append(normalizedString[i]);
                }
            }
    
            string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
            StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);
    
            foreach (char character in PreAsciiCompliant)
            {
                //Reject all special characters except \n\r (Carriage-Return and Line-Feed). 
                //Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
                if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13)) 
                {
                    AsciiComplient.Append(character);
                }
            }
            return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
        }
    
    0 讨论(0)
  • 2020-12-15 21:52

    I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)

    Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.

    There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?

    0 讨论(0)
  • 2020-12-15 21:55

    I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.

    Here's the code I'm gonna use:

    for (it= dictionary.begin(); it != dictionary.end(); it++)
    {
        strMine=(it->first);
        found=toReplace.find(strMine);
        while (found != std::string::npos)
        {
            strAux=(it->second);
            toReplace.erase(found,2);
            toReplace.insert(found,strAux);
            found=toReplace.find(strMine,found+1);
        }
    } 
    

    I will change it next time I have to turn my program in for correction (in about 6 weeks).

    0 讨论(0)
  • 2020-12-15 21:58

    You might want to check out the boost (http://www.boost.org/) library.

    It has a regexp library, which you could use. In addition it has a specific library that has some functions for string manipulation (link) including replace.

    0 讨论(0)
  • 2020-12-15 22:07

    Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).

    0 讨论(0)
  • 2020-12-15 22:07

    If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)

    Edit: Sorry, you're right, tr doesn't seem to work. How about sed? It's a pretty stupid script I've written, but it works for me.

    #!/bin/sed -f
    s/á/a/g;
    s/é/e/g;
    s/í/i/g;
    s/ó/o/g;
    s/ú/u/g;
    s/ñ/n/g;
    
    0 讨论(0)
提交回复
热议问题