I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted
The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.
#include <iostream>
#include <cctype>
#include <clocale>
int main()
{
unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
// but ´ (acute accent) in ISO-8859-1
std::setlocale(LC_ALL, "en_US.iso88591");
std::cout << std::hex << std::showbase;
std::cout << "in iso8859-1, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
std::setlocale(LC_ALL, "en_US.iso885915");
std::cout << "in iso8859-15, tolower('0xb4') gives "
<< std::tolower(c) << '\n';
}
You might also change std::string
to std::wstring
which is Unicode on many C++ implementations.
wstring NotLowerCase = L"Grüßen";
wstring LowerCase;
for (auto&& ch : NotLowerCase) {
LowerCase += towlower(ch);
}
Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.
Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.
Finally, C++ has more sophisticated support for managing locales, see <locale> for details.
use ASCII
string NotLowerCase = "Grüßen";
string LowerCase = "";
for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
if(NotLowerCase[i]<65||NotLowerCase[i]>122)
{
LowerCase+='?';
}
else
LowerCase += tolower(NotLowerCase[i]);
}
I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to ""
(empty string).
std::locale::global(std::locale(""));
That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.
Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8
) to system standard wide character codes that can be used in functions like std::tolower
that operate on one character at a time.
This is important because multi-byte character sets like UTF-8
can not be converted using single character operations like with std::tolower()
.
Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.
// Convert from multi-byte codes to wide string codes
std::wstring mb_to_ws(std::string const& mb)
{
std::wstring ws;
std::mbstate_t ps{};
char const* src = mb.data();
std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
ws.resize(len);
src = mb.data();
mbsrtowcs(&ws[0], &src, ws.size(), &ps);
if(src)
throw std::runtime_error("invalid multibyte character after: '"
+ std::string(mb.data(), src) + "'");
ws.pop_back();
return ws;
}
// Convert from wide string codes to multi-byte codes
std::string ws_to_mb(std::wstring const& ws)
{
std::string mb;
std::mbstate_t ps{};
wchar_t const* src = ws.data();
std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
mb.resize(len);
src = ws.data();
wcsrtombs(&mb[0], &src, mb.size(), &ps);
if(src)
throw std::runtime_error("invalid wide character");
mb.pop_back();
return mb;
}
int main()
{
// set locale to the one chosen by the user
// (or the one set by the system default)
std::locale::global(std::locale(""));
try
{
string NotLowerCase = "Grüßen";
std::cout << NotLowerCase << '\n';
// convert system/user multibyte character codes
// to wide string versions
std::wstring ws1 = mb_to_ws(NotLowerCase);
std::wstring ws2;
for(unsigned int i = 0; i < ws1.length(); i++) {
// use the system/user locale
ws2 += std::tolower(ws1[i], std::locale(""));
}
// convert wide string character codes back
// to system/user multibyte versions
string LowerCase = ws_to_mb(ws2);
std::cout << LowerCase << '\n';
}
catch(std::exception const& e)
{
std::cerr << e.what() << '\n';
return EXIT_FAILURE;
}
catch(...)
{
std::cerr << "Unknown exception." << '\n';
return EXIT_FAILURE;
}
return EXIT_SUCCESS;
}
Code not heavily tested