C++20 added char8_t
and std::u8string
for UTF-8. However, there is no UTF-8 version of std::cout
and OS APIs mostly expect char
At present, std::c8rtomb
and std::mbrtoc8
are the the only interfaces provided by the standard that enable conversion between the execution encoding and UTF-8. The interfaces are awkward. They were designed to match pre-existing interfaces like std::c16rtomb
and std::mbrtoc16
. The wording added to the C++ standard for these new interfaces intentionally matches the wording in the C standard for the pre-existing related functions (hopefully these new functions will eventually be added to C; I still need to pursue that). The intent in matching the C standard wording, as confusing as it is, is to ensure that anyone familiar with the C wording recognizes that the char8_t
interfaces work the same way.
cppreference.com has some examples for the UTF-16 versions of these functions that should be useful for understanding the char8_t
variants.
UTF-8 "support" in C++20 seems to be a bad joke.
The only UTF functionality in the STL is support for strings and string_views (std::u8string, std::u8string_view, std::u16string, ...). That is all. There is no STL support for UTF coding in regular expressions, formatting, file i/o and so on.
In C++17 you can--at least--easily treat any UTF-8 data as 'char' data, which makes usage of std::regex, std::fstream, std::cout, etc. possible without loss of performance.
In C++20 things will change. You cannot longer write for example std::string text = u8"...";
It will be impossible to write something like
std::u8fstream file; std::u8string line; ... file << line;
since there is no std::u8fstream.
Even the new C++20 std::format does not support UTF at all, because all necessary overloads are simply missing. You cannot write
std::u8string text = std::format(u8"...{}...", 42);
To make matters worse, there is no simple casting (or conversion) between std::string and std::u8string (or even between const char* and const char8_t*). So if you want to format (using std::format) or input/output (std::cin, std::cout, std::fstream, ...) UTF-8 data, you have to internally copy all strings. - That will be an unnecessary performance killer.
Finally, what use will UTF have without input, output, and formatting?