How to uppercase/lowercase UTF-8 characters in C++?

后端 未结 4 2093
野性不改
野性不改 2021-02-19 09:59

Let\'s imagine I have a UTF-8 encoded std::string containing the following:

óó

and I\'d like to convert it to the following:

<

4条回答
  •  情书的邮戳
    2021-02-19 10:07

    There are some examples on StackOverflow but they use wide character strings, and other answers say you shouldn't be using wide character strings for UTF-8.

    The article within (utf8everywhere) and answers apply to Windows. The C++ standard requires that wchar_t be wide enough to accomodate all supported code units (32-bits wide) but works perfectly fine with UTF-8. On Windows, wchar_t is UTF-16 but if you're on Windows you have more problems than just that if we're going to be honest (namely their horrifying API).

    It also appears that this problem can be very "tricky" in that the output might be dependent upon the user's locale.

    Not really. Set the locale inside the code. Some programs like sort don't work properly if you don't set the locale inside the shell for example, so the onus on the user.

    I was expecting to just use something like std::toupper(), but the usage is really unclear to me because it seems like I'm not just converting one character at a time but an entire string.

    The code example uses iterators. If you don't want to convert every character, don't.

    Also, this Ideone example I put together seems to show that toupper() of 0xc3b3 is just 0xc3b3, which is an unexpected result. Calling setlocale to either UTF-8 or ISO8859-1 doesn't appear to change the outcome.

    You have undefined behavior. The range of unsigned char is 255. 0xc3b3 way surpasses that.

    I'd love some guidance if you could shed some light on either what I'm doing wrong or why my question/premise is faulty!

    This example works perfectly fine:

    #include 
    #include 
    #include 
    
    int main()
    {
        std::setlocale(LC_CTYPE, "en_US.UTF-8"); // the locale will be the UTF-8 enabled English
    
        std::wstring str = L"óó";
    
        std::wcout << str << std::endl;
    
        for (std::wstring::iterator it = str.begin(); it != str.end(); ++it)
            *it = towupper(*it);
    
        std::wcout << str << std::endl;
    }
    

    Outputs: ÓÓ

提交回复
热议问题