I\'m trying to write unicode strings to the screen in C++ on Windows. I changed my console font to Lucida Console
and I set the output to CP_UTF8
a
Although you've set your console to expect UTF-8 output, I suspect that your compiler is treating string literals as being in some other character set. I don't know why the C compiler acts differently.
The good news is that C++11 includes some support for UTF-8, and that Microsoft has implemented the relevant portions of the Standard. The code is a little hairy, but you'll want to look into std::wstring_convert (converts to and from UTF-8) and the <cuchar> header.
You can use those functions to convert to UTF-8, and assuming your console is expecting UTF-8, things should work correctly.
Personally, when I need to debug something like this, I often direct the output to a text file. Text editors seem to handle Unicode better than the Windows console. In my case, I often output the code points correctly, but have the console set up incorrectly so that I still end up printing garbage.
I can tell you that this worked for me in both Linux (using Clang) and Windows (using GCC 4.7.3 and Clang 3.5; you need to add "std=c++11" to the command line to compile with GCC or Clang):
#include <cstdio>
int main()
{
const char text[] = u8"Россия";
std::printf("%s\n", text);
}
Using Visual C++ (2012, but I believe it would also work with 2010), I had to use:
#include <codecvt>
#include <cstdio>
#include <locale>
#include <string>
int main()
{
std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
auto text = converter.to_bytes(L"Россия");
std::printf("%s\n", text.c_str());
}
It's more surprising that C implementation does work here than that C++ doesn't. char
can contain only one byte (numerical values 0-255) and thus console should show only ASCII characters.
C must be doing some magic for you here - in fact it guesses that these bytes from outside the ASCII range (which is 0-127) you're providing form an Unicode (probably UTF-8) multi-byte character. C++ just displays each byte of your const char[]
array, and since UTF bytes treated separately don't have distinct glyphs in your font, it puts these �. Note that you assign 6 letters and get 12 question marks.
You can read about UTF-8 and ASCII encoding if you want, but the point is that std::wstring
and std::wcout
is really the best solution designed to handle larger-than-byte characters.
(If you're not using Latin characters at all, you don't even save memory when you use char
-based solutions such as const char[]
and std::string
instead of std::wstring
. All these Cyrillic codes have to take some space anyway).
If your file is encoded as UTF-8, you'll find the string length is 12. Run strlen
from <string.h>
(<cstring>
) on it to see what I mean. Setting the output code page will print the bytes exactly as you see them.
What the compiler sees is equivalent to the following:
const char text[] = "\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd1\x8f";
Wrap it in a wide string (wchar_t
in particular), and things aren't so nice.
Why does C++ handle it differently? I haven't the slightest clue, except perhaps the mechanism used by the code underlying the C++ version is somewhat ignorant (e.g. std::cout
happily outputs whatever you want blindly). Whatever the cause, apparently sticking to C is safest...which is actually unexpected to me considering the fact that Microsoft's own C compiler can't even compile C99 code.
In any case, I'd advise against outputting to the Windows console if possible, Unicode or not. Files are so much more reliable, not to mention less of a hassle.