I have a problem with a string in C++ which has several words in Spanish. This means that I have a lot of words with accents and tildes. I want to replace them for their not
/// <summary>
///
/// Replace any accent and foreign character by their ASCII equivalent.
/// In other words, convert a string to an ASCII-complient string.
///
/// This also get rid of special hidden character, like EOF, NUL, TAB and other '\0', except \n\r
///
/// Tests with accents and foreign characters:
/// Before: "äæǽaeöœoeüueÄAeÜUeÖOeÀÁÂÃÄÅǺĀĂĄǍΑΆẢẠẦẪẨẬẰẮẴẲẶАAàáâãåǻāăąǎªαάảạầấẫẩậằắẵẳặаaБBбbÇĆĈĊČCçćĉċčcДDдdÐĎĐΔDjðďđδdjÈÉÊËĒĔĖĘĚΕΈẼẺẸỀẾỄỂỆЕЭEèéêëēĕėęěέεẽẻẹềếễểệеэeФFфfĜĞĠĢΓГҐGĝğġģγгґgĤĦHĥħhÌÍÎÏĨĪĬǏĮİΗΉΊΙΪỈỊИЫIìíîïĩīĭǐįıηήίιϊỉịиыїiĴJĵjĶΚКKķκкkĹĻĽĿŁΛЛLĺļľŀłλлlМMмmÑŃŅŇΝНNñńņňʼnνнnÒÓÔÕŌŎǑŐƠØǾΟΌΩΏỎỌỒỐỖỔỘỜỚỠỞỢОOòóôõōŏǒőơøǿºοόωώỏọồốỗổộờớỡởợоoПPпpŔŖŘΡРRŕŗřρрrŚŜŞȘŠΣСSśŝşșšſσςсsȚŢŤŦτТTțţťŧтtÙÚÛŨŪŬŮŰŲƯǓǕǗǙǛŨỦỤỪỨỮỬỰУUùúûũūŭůűųưǔǖǘǚǜυύϋủụừứữửựуuÝŸŶΥΎΫỲỸỶỴЙYýÿŷỳỹỷỵйyВVвvŴWŵwŹŻŽΖЗZźżžζзzÆǼAEßssIJIJijijŒOEƒf'ξksπpβvμmψpsЁYoёyoЄYeєyeЇYiЖZhжzhХKhхkhЦTsцtsЧChчchШShшshЩShchщshchЪъЬьЮYuюyuЯYaяya"
/// After: "aaeooeuueAAeUUeOOeAAAAAAAAAAAAAAAAAAAAAAAaaaaaaaaaaaaaaaaaaaaaaaBbCCCCCCccccccDdDDjddjEEEEEEEEEEEEEEEEEEeeeeeeeeeeeeeeeeeeFfGGGGGgggggHHhhIIIIIIIIIIIIIiiiiiiiiiiiiJJjjKKkkLLLLllllMmNNNNNnnnnnOOOOOOOOOOOOOOOOOOOOOOooooooooooooooooooooooPpRRRRrrrrSSSSSSssssssTTTTttttUUUUUUUUUUUUUUUUUUUUUUUUuuuuuuuuuuuuuuuuuuuuuuuYYYYYYYYyyyyyyyyVvWWwwZZZZzzzzAEssIJijOEf'kspvmpsYoyoYeyeYiZhzhKhkhTstsChchShshShchshchYuyuYaya"
///
/// Tests with invalid 'special hidden characters':
/// Before: "\0\0\000\0000Bj��rk�\'\"\\\0\a\b\f\n\r\t\v\u0020���oacu\'\\\'te�"
/// After: "00000Bjrk'\"\\\n\r oacu'\\'te"
///
/// </summary>
private string Normalize(string StringToClean)
{
string normalizedString = StringToClean.Normalize(NormalizationForm.FormD);
StringBuilder Buffer = new StringBuilder(StringToClean.Length);
for (int i = 0; i < normalizedString.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(normalizedString[i]) != UnicodeCategory.NonSpacingMark)
{
Buffer.Append(normalizedString[i]);
}
}
string PreAsciiCompliant = Buffer.ToString().Normalize(NormalizationForm.FormC);
StringBuilder AsciiComplient = new StringBuilder(PreAsciiCompliant.Length);
foreach (char character in PreAsciiCompliant)
{
//Reject all special characters except \n\r (Carriage-Return and Line-Feed).
//Get rid of special hidden character, like EOF, NUL, TAB and other '\0'
if (((int)character >= 32 && (int)character < 127) || ((int)character == 10 || (int)character == 13))
{
AsciiComplient.Append(character);
}
}
return AsciiComplient.ToString().Trim(); // Remove spaces at start and end of string if any
}
I disagree with the currently "approved" answer. The question makes perfect sense when you are indexing text. Like case-insensitive search, accent-insensitive search is a good idea. "naïve" matches "Naïve" matches "naive" matches "NAİVE" (you do know that an uppercase i is İ in Turkish? That's why you ignore accents)
Now, the best algorithm is hinted at the approved answer: Use NKD (decomposition) to decompose accented letters into the base letter and a seperate accent, and then remove all accents.
There is little point in the re-composition afterwards, though. You removed most sequences which would change, and the others are for all intents and purposes identical anyway. WHat's the difference between æ in NKC and æ in NKD?
I could not link the ICU libraries but I still think it's the best solution. As I need this program to be functional as soon as possible I made a little program (that I have to improve) and I'm going to use that. Thank you all for for suggestions and answers.
Here's the code I'm gonna use:
for (it= dictionary.begin(); it != dictionary.end(); it++)
{
strMine=(it->first);
found=toReplace.find(strMine);
while (found != std::string::npos)
{
strAux=(it->second);
toReplace.erase(found,2);
toReplace.insert(found,strAux);
found=toReplace.find(strMine,found+1);
}
}
I will change it next time I have to turn my program in for correction (in about 6 weeks).
You might want to check out the boost (http://www.boost.org/) library.
It has a regexp library, which you could use. In addition it has a specific library that has some functions for string manipulation (link) including replace.
Try using std::wstring instead of std::string. UTF-16 should work (as opposed to ASCII).
If you can (if you're running Unix), I suggest using the tr facility for this: it's custom-built for this purpose. Remember, no code == no buggy code. :-)
Edit: Sorry, you're right, tr
doesn't seem to work. How about sed
? It's a pretty stupid script I've written, but it works for me.
#!/bin/sed -f
s/á/a/g;
s/é/e/g;
s/í/i/g;
s/ó/o/g;
s/ú/u/g;
s/ñ/n/g;