Let\'s imagine I have a UTF-8 encoded std::string
containing the following:
óó
and I\'d like to convert it to the following:
<
There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8.
The article within (utf8everywhere) and answers apply to Windows. The C++ standard requires that wchar_t
be wide enough to accomodate all supported code units (32-bits wide) but works perfectly fine with UTF-8. On Windows, wchar_t
is UTF-16 but if you're on Windows you have more problems than just that if we're going to be honest (namely their horrifying API).
It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.
Not really. Set the locale inside the code. Some programs like sort
don't work properly if you don't set the locale inside the shell for example, so the onus on the user.
I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string.
The code example uses iterators. If you don't want to convert every character, don't.
Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.
You have undefined behavior. The range of unsigned char
is 255. 0xc3b3
way surpasses that.
I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!
This example works perfectly fine:
#include
#include
#include
int main()
{
std::setlocale(LC_CTYPE, "en_US.UTF-8"); // the locale will be the UTF-8 enabled English
std::wstring str = L"óó";
std::wcout << str << std::endl;
for (std::wstring::iterator it = str.begin(); it != str.end(); ++it)
*it = towupper(*it);
std::wcout << str << std::endl;
}
Outputs: ÓÓ