Reading UTF-8 text and converting to UTF-16 using standard C++ wifstream

大城市里の小女人 提交于 2019-12-04 07:36:51

This works on Windows with Visual Studio, I think as far back as VS2010

#include <locale>  // consume_header, locale
#include <codecvt> // codecvt_utf8_utf16

src.imbue(std::locale(
    src.getloc(),
    new std::codecvt_utf8_utf16<wchar_t, 0x10FFFF, std::consume_header>));

Since Windows uses a 16-bit wchar_t and also universally uses UTF-16 as the wide character encoding this works great in that environment. (And because I'm assuming a Windows environment my example includes consume_header to handle Windows' convention of adding a header to UTF-8 data).

On other platforms wchar_t is generally 32-bit and, while you can store UTF-16 code unit values in such 32-bit code units, nothing else will be written expecting such a thing. On a platform with 32-bit wchar_t you might prefer to use std::codecvt_utf8<wchar_t> to produce UTF-32 wide strings.


For portability ideally what you'd want is a codecvt facet that knows how to convert from UTF-8 to either the locale's wchar_t encoding or the wide execution encoding. The problem with that, however, is that there's no requirement for any wide encoding to support the entire range of characters representable in UTF-8. The bottom line is that wchar_t isn't particularly useful for portable code as specified.

However one trick that might be useful if you're sticking to platforms that use UTF-16 or UTF-32 depending on the size of wchar_t is:

template <int N> struct get_codecvt_utf8_wchar_impl;
template <> struct get_codecvt_utf8_wchar_impl<16> {
  using type = std::codecvt_utf8_utf16<wchar_t>;
};
template <> struct get_codecvt_utf8_wchar_impl<32> {
  using type = std::codecvt_utf8<wchar_t>;
};

using codecvt_utf8_wchar = get_codecvt_utf8_wchar_impl<
    sizeof(wchar_t) * CHAR_BIT>::type;

src.imbue(std::locale(src.getloc(), new codecvt_utf8_wchar));

You can also use char16_t and char32_t, which would lend themselves to portable code, however the standard is missing a few bits to make iostreams usable with these character types and also implementations don't fully support what is specified.

VS I think still implements char16_t and char32_t as typedefs and so the template specializations using them don't work (even though the specializations do exist if you look in the headers, they're just ifdef'd out because the compiler can't handle them). libstdc++ doesn't implement the template specializations yet even though it supports char16_t and char32_t as real types. The most complete implementation I know of is libc++ with a suitable compiler (gcc or clang), but even that is still missing the <cuchar> header.

Since implementation support is limited that sort of prevents portable code from doing much with these besides using them as a consistent representation in user code across platforms (though that is useful even on its own).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!