How to uppercase/lowercase UTF-8 characters in C++?

后端未结

关注

 4  2093

野性不改 2021-02-19 09:59

Let\'s imagine I have a UTF-8 encoded std::string containing the following:

óó

and I\'d like to convert it to the following:

4条回答

情书的邮戳 (楼主)

2021-02-19 10:07
There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8.

The article within (utf8everywhere) and answers apply to Windows. The C++ standard requires that wchar_t be wide enough to accomodate all supported code units (32-bits wide) but works perfectly fine with UTF-8. On Windows, wchar_t is UTF-16 but if you're on Windows you have more problems than just that if we're going to be honest (namely their horrifying API).

It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

Not really. Set the locale inside the code. Some programs like sort don't work properly if you don't set the locale inside the shell for example, so the onus on the user.

I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string.

The code example uses iterators. If you don't want to convert every character, don't.

Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

You have undefined behavior. The range of unsigned char is 255. 0xc3b3 way surpasses that.

I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!

This example works perfectly fine:
```
#include 
#include 
#include 

int main()
{
    std::setlocale(LC_CTYPE, "en_US.UTF-8"); // the locale will be the UTF-8 enabled English

    std::wstring str = L"óó";

    std::wcout << str << std::endl;

    for (std::wstring::iterator it = str.begin(); it != str.end(); ++it)
        *it = towupper(*it);

    std::wcout << str << std::endl;
}
```
Outputs: ÓÓ
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...