Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideC
The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.
Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.
The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.
Here's a quick implementation of wchar_t
to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.
std::string wchar_to_UTF8(const wchar_t * in)
{
std::string out;
unsigned int codepoint = 0;
for (in; *in != 0; ++in)
{
if (*in >= 0xd800 && *in <= 0xdbff)
codepoint = ((*in - 0xd800) << 10) + 0x10000;
else
{
if (*in >= 0xdc00 && *in <= 0xdfff)
codepoint |= *in - 0xdc00;
else
codepoint = *in;
if (codepoint <= 0x7f)
out.append(1, static_cast(codepoint));
else if (codepoint <= 0x7ff)
{
out.append(1, static_cast(0xc0 | ((codepoint >> 6) & 0x1f)));
out.append(1, static_cast(0x80 | (codepoint & 0x3f)));
}
else if (codepoint <= 0xffff)
{
out.append(1, static_cast(0xe0 | ((codepoint >> 12) & 0x0f)));
out.append(1, static_cast(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast(0x80 | (codepoint & 0x3f)));
}
else
{
out.append(1, static_cast(0xf0 | ((codepoint >> 18) & 0x07)));
out.append(1, static_cast(0x80 | ((codepoint >> 12) & 0x3f)));
out.append(1, static_cast(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast(0x80 | (codepoint & 0x3f)));
}
codepoint = 0;
}
}
return out;
}
The above code works for both UTF-16 and UTF-32 input, simply because the range d800
through dfff
are invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_t
is 32 bits then you could remove some code to optimize the function.
std::wstring UTF8_to_wchar(const char * in)
{
std::wstring out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (sizeof(wchar_t) > 2)
out.append(1, static_cast(codepoint));
else if (codepoint > 0xffff)
{
out.append(1, static_cast(0xd800 + (codepoint >> 10)));
out.append(1, static_cast(0xdc00 + (codepoint & 0x03ff)));
}
else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast(codepoint));
}
}
return out;
}
Again if you know that wchar_t
is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2
is known at compile time, so any decent compiler will recognize dead code and remove it.