I\'m trying to do a very simple task: take a unicode-aware wstring
and convert it to a string
, encoded as UTF8 bytes, and then the opposite way aro
Thanks everyone, but ultimately I resorted to http://utfcpp.sourceforge.net/ -- it's a header-only library that's very lightweight and easy to use. I'm sharing a demo code here, should anyone find it useful:
inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}
Usage:
wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);
There's already a boost link in the comments, but in the almost-standard C++0x, there is wstring_convert
that does this
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
std::string s = conv.to_bytes(uchars);
std::wstring ws2 = conv.from_bytes(s);
std::cout << std::boolalpha
<< (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
<< (ws2 == uchars ) << '\n';
}
output when compiled with MS Visual Studio 2010 EE SP1 or with CLang++ 2.9
true
true
For a drop-in replacement for std::string
/std::wstring
that handles utf8, see TINYUTF8.
In combination with <codecvt> you can convert pretty much from/to every encoding from/to utf8, which you then handle through the above library.
Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16
Here are some convenient examples from the docs:
string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);
Almost as easy as Python encoding/decoding :)
Note that Boost.Locale is not a header-only library.