Any good solutions for C++ string code point and code unit?

后端 未结 1 464
青春惊慌失措
青春惊慌失措 2020-12-03 16:50

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C++11 has std::string a = u8\"很烫烫的一锅汤\";

相关标签:
1条回答
  • 2020-12-03 17:10

    I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

    // This should convert to whatever the system wide character encoding 
    // is for the platform (UTF-32/Linux - UCS-2/Windows)
    std::string ws_to_utf8(std::wstring const& s)
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
        std::string utf8 = cnv.to_bytes(s);
        if(cnv.converted() < s.size())
            throw std::runtime_error("incomplete conversion");
        return utf8;
    }
    
    std::wstring utf8_to_ws(std::string const& utf8)
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
        std::wstring s = cnv.from_bytes(utf8);
        if(cnv.converted() < utf8.size())
            throw std::runtime_error("incomplete conversion");
        return s;
    }
    
    int main()
    {
        std::string s = u8"很烫烫的一锅汤";
    
        auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)
    
        // now we can use code-point indexes on the wide string
    
        std::cout << s << " is " << w.size() << " characters long" << '\n';
    }
    

    Output:

    很烫烫的一锅汤 is 7 characters long
    

    If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

    std::string utf32_to_utf8(std::u32string const& utf32)
    {
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
        std::string utf8 = cnv.to_bytes(utf32);
        if(cnv.converted() < utf32.size())
            throw std::runtime_error("incomplete conversion");
        return utf8;
    }
    
    std::u32string utf8_to_utf32(std::string const& utf8)
    {
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
        std::u32string utf32 = cnv.from_bytes(utf8);
        if(cnv.converted() < utf8.size())
            throw std::runtime_error("incomplete conversion");
        return utf32;
    }
    

    NOTE: As of C++17 std::wstring_convert is deprecated.

    However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

    0 讨论(0)
提交回复
热议问题