assuming that I have
uint32_t a(3084);
I would like to create a string that stores the unicode character U+3084
which means that I
The C++ standard contains the std::codecvt<char32_t, char, mbstate_t>
facet which converts between UTF-32 and UTF-8 according to 22.4.1.4 [locale.codecvt] paragraph 3. Sadly, the std::codecvt<...>
facets aren't easy to use. At some point there was discussion about filtering stream buffers which would take case of the code conversion (the standard C++ library needs to implement them anyway for std::basic_filebuf<...>
) but I can't see any trace of these.
std::string_convert::to_bytes has a single-char overload just for you.
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>
// utility function for output
void hex_print(const std::string& s)
{
std::cout << std::hex << std::setfill('0');
for(unsigned char c : s)
std::cout << std::setw(2) << static_cast<int>(c) << ' ';
std::cout << std::dec << '\n';
}
int main()
{
uint32_t a(3084);
std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
std::string u8str = conv1.to_bytes(a);
std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
hex_print(u8str);
}
I get (with libc++)
$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c
auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;
prints
や
for me (using g++ 4.8.1). s
has type const char*
, as you'd expect, but I don't know if this is implementation defined. Unfortunately C++ doesn't have any support for manipulation of UTF8 strings are far as I know; for that you need to use a library like Glib::ustring
.
Here's some C++ code that wouldn't be hard to convert to C. Adapted from an older answer.
std::string UnicodeToUTF8(unsigned int codepoint)
{
std::string out;
if (codepoint <= 0x7f)
out.append(1, static_cast<char>(codepoint));
else if (codepoint <= 0x7ff)
{
out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else if (codepoint <= 0xffff)
{
out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else
{
out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
return out;
}