Convert between std::u8string and std::string

前端 未结 2 1596
攒了一身酷
攒了一身酷 2021-02-12 10:01

C++20 added char8_t and std::u8string for UTF-8. However, there is no UTF-8 version of std::cout and OS APIs mostly expect char

相关标签:
2条回答
  • 2021-02-12 10:51

    At present, std::c8rtomb and std::mbrtoc8 are the the only interfaces provided by the standard that enable conversion between the execution encoding and UTF-8. The interfaces are awkward. They were designed to match pre-existing interfaces like std::c16rtomb and std::mbrtoc16. The wording added to the C++ standard for these new interfaces intentionally matches the wording in the C standard for the pre-existing related functions (hopefully these new functions will eventually be added to C; I still need to pursue that). The intent in matching the C standard wording, as confusing as it is, is to ensure that anyone familiar with the C wording recognizes that the char8_t interfaces work the same way.

    cppreference.com has some examples for the UTF-16 versions of these functions that should be useful for understanding the char8_t variants.

    • https://en.cppreference.com/w/cpp/string/multibyte/mbrtoc16
    • https://en.cppreference.com/w/cpp/string/multibyte/c16rtomb
    0 讨论(0)
  • 2021-02-12 10:58

    UTF-8 "support" in C++20 seems to be a bad joke.

    The only UTF functionality in the STL is support for strings and string_views (std::u8string, std::u8string_view, std::u16string, ...). That is all. There is no STL support for UTF coding in regular expressions, formatting, file i/o and so on.

    In C++17 you can--at least--easily treat any UTF-8 data as 'char' data, which makes usage of std::regex, std::fstream, std::cout, etc. possible without loss of performance.

    In C++20 things will change. You cannot longer write for example std::string text = u8"..."; It will be impossible to write something like

    std::u8fstream file; std::u8string line; ... file << line;
    

    since there is no std::u8fstream.

    Even the new C++20 std::format does not support UTF at all, because all necessary overloads are simply missing. You cannot write

    std::u8string text = std::format(u8"...{}...", 42);
    

    To make matters worse, there is no simple casting (or conversion) between std::string and std::u8string (or even between const char* and const char8_t*). So if you want to format (using std::format) or input/output (std::cin, std::cout, std::fstream, ...) UTF-8 data, you have to internally copy all strings. - That will be an unnecessary performance killer.

    Finally, what use will UTF have without input, output, and formatting?

    0 讨论(0)
提交回复
热议问题