C++ tolower on special characters such as ü

前端 未结 3 432
耶瑟儿~
耶瑟儿~ 2020-12-19 13:25

I have trouble transforming a string to lowercase with the tolower() function in C++. With normal strings, it works as expected, however special characters are not converted

相关标签:
3条回答
  • 2020-12-19 14:05

    The sample code (below) from tolower shows how you fix this; you have to use something other than the default "C" locale.

    #include <iostream>
    #include <cctype>
    #include <clocale>
    
    int main()
    {
        unsigned char c = '\xb4'; // the character Ž in ISO-8859-15
                                  // but ´ (acute accent) in ISO-8859-1 
    
        std::setlocale(LC_ALL, "en_US.iso88591");
        std::cout << std::hex << std::showbase;
        std::cout << "in iso8859-1, tolower('0xb4') gives "
                  << std::tolower(c) << '\n';
        std::setlocale(LC_ALL, "en_US.iso885915");
        std::cout << "in iso8859-15, tolower('0xb4') gives "
                  << std::tolower(c) << '\n';
    }
    

    You might also change std::string to std::wstring which is Unicode on many C++ implementations.

    wstring NotLowerCase = L"Grüßen";
    wstring LowerCase;
    for (auto&& ch : NotLowerCase) {
        LowerCase += towlower(ch);
        }
    

    Guidance from Microsoft is to "Normalize strings to uppercase", so you might use toupper or towupper instead.

    Keep in mind that a character-by-character transformation might not work well for some languages. For example, using German as spoken in Germany, making Grüßen all upper-case turns it into GRÜESSEN (although there is now a capital ẞ). There are numerous other "problems" such a combining characters; if you're doing real "production" work with strings, you really want a completely different approach.

    Finally, C++ has more sophisticated support for managing locales, see <locale> for details.

    0 讨论(0)
  • 2020-12-19 14:05

    use ASCII

    string NotLowerCase = "Grüßen";
    string LowerCase = "";
    for (unsigned int i = 0; i < NotLowerCase.length(); i++) {
        if(NotLowerCase[i]<65||NotLowerCase[i]>122)
        {
            LowerCase+='?';
        }
        else
            LowerCase += tolower(NotLowerCase[i]);
    }
    
    0 讨论(0)
  • 2020-12-19 14:18

    I think the most portable way to do this is to use the user selected locale which is achieved by setting the locale to "" (empty string).

    std::locale::global(std::locale("")); 
    

    That sets the locale to whatever was in use where the program was run and it effects the standard character conversion routines (std::mbsrtowcs & std::wcsrtombs) that convert between multi-byte and wide-string characters.

    Then you can use those functions to convert from the system/user selected multi-byte characters (such as UTF-8) to system standard wide character codes that can be used in functions like std::tolower that operate on one character at a time.

    This is important because multi-byte character sets like UTF-8 can not be converted using single character operations like with std::tolower().

    Once you have converted the wide string version to upper/lower case it can then be converted back to the system/user multibyte character set for printing to the console.

    // Convert from multi-byte codes to wide string codes
    std::wstring mb_to_ws(std::string const& mb)
    {
        std::wstring ws;
        std::mbstate_t ps{};
        char const* src = mb.data();
    
        std::size_t len = 1 + mbsrtowcs(0, &src, 3, &ps);
    
        ws.resize(len);
        src = mb.data();
    
        mbsrtowcs(&ws[0], &src, ws.size(), &ps);
    
        if(src)
            throw std::runtime_error("invalid multibyte character after: '"
                + std::string(mb.data(), src) + "'");
    
        ws.pop_back();
    
        return ws;
    }
    
    // Convert from wide string codes to multi-byte codes
    std::string ws_to_mb(std::wstring const& ws)
    {
        std::string mb;
        std::mbstate_t ps{};
        wchar_t const* src = ws.data();
    
        std::size_t len = 1 + wcsrtombs(0, &src, 0, &ps);
    
        mb.resize(len);
        src = ws.data();
    
        wcsrtombs(&mb[0], &src, mb.size(), &ps);
    
        if(src)
            throw std::runtime_error("invalid wide character");
    
        mb.pop_back();
    
        return mb;
    }
    
    int main()
    {
        // set locale to the one chosen by the user
        // (or the one set by the system default)
        std::locale::global(std::locale(""));
    
        try
        {
            string NotLowerCase = "Grüßen";
    
            std::cout << NotLowerCase << '\n';
    
            // convert system/user multibyte character codes
            // to wide string versions
            std::wstring ws1 = mb_to_ws(NotLowerCase);
            std::wstring ws2;
    
            for(unsigned int i = 0; i < ws1.length(); i++) {
                // use the system/user locale
                ws2 += std::tolower(ws1[i], std::locale("")); 
            }
    
            // convert wide string character codes back
            // to system/user multibyte versions
            string LowerCase = ws_to_mb(ws2);
    
            std::cout << LowerCase << '\n';
        }
        catch(std::exception const& e)
        {
            std::cerr << e.what() << '\n';
            return EXIT_FAILURE;
        }
        catch(...)
        {
            std::cerr << "Unknown exception." << '\n';
            return EXIT_FAILURE;
        }
    
        return EXIT_SUCCESS;
    }
    

    Code not heavily tested

    0 讨论(0)
提交回复
热议问题