wchar_t* with UTF8 chars in MSVC

问题

I am trying to format wchar_t* with UTF-8 characters using vsnprintf and then printing the buffer using printf.

Given the following code:

/*
  This code is modified version of KB sample:
  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm

  The usage of `setlocale` is required by my real-world scenario,
  but can be modified if that fixes the issue.
*/

#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>

#ifdef MSVC
#include <windows.h>
#endif

void vout(char *string, char *fmt, ...)
{
   setlocale(LC_CTYPE, "en_US.UTF-8");
   va_list arg_ptr;

   va_start(arg_ptr, fmt);
   vsnprintf(string, 100, fmt, arg_ptr);
   va_end(arg_ptr);
}

int main(void)
{
   setlocale(LC_ALL, "");
#ifdef MSVC
   SetConsoleOutputCP(65001); // with or without; no dice
#endif

   char string[100];

   wchar_t arr[] = { 0x0119 };
   vout(string, "%ls", arr);
   printf("This string should have 'ę' (e with ogonek / tail) after colon:  %s\n", string);
   return 0;
}

I compiled with gcc v5.4 on Ubuntu 16 to get the desired output in BASH:

gcc test.c -o test_vsn
./test_vsn
This string should have 'ę' (e with ogonek / tail) after colon:  ę

However, on Windows 10 with CL v19.10.25019 (VS 2017), I get weird output in CMD:

cl test.c /Fetest_vsn /utf-8
.\test_vsn
This string should have 'T' (e with ogonek / tail) after colon:  e

(the ę before colon becomes T and after the colon is e without ogonek)

Note that I used CL's new /utf-8 switch (introduced in VS 2015), which apparently has no effect with or without. Based on their blog post:

There is also a /utf-8 option that is a synonym for setting “/source-charset:utf-8” and “/execution-charset:utf-8”.

(my source file already has BOM / utf8'ness and execution-charset is apparently not helping)

What could be the minimal amount of changes to the code / compiler switches to make the output look identical to that of gcc?

回答1:

Based on @RemyLebeau's comment, I modified the code to use w variant of the printf APIs to get the output identical with msvc on Windows, matching that of gcc on Unix.

Additionally, instead of changing codepage, I have now used _setmode (FILE translation mode).

/*
  This code is modified version of KB sample:
  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_73/rtref/vsnprintf.htm

  The usage of `setlocale` is required by my real-world scenario,
  but can be modified if that fixes the issue.
*/

#include <wchar.h>
#include <stdarg.h>
#include <stdio.h>
#include <locale.h>

#ifdef _WIN32
#include <io.h> //for _setmode
#include <fcntl.h> //for _O_U16TEXT
#endif

void vout(wchar_t *string, wchar_t *fmt, ...)
{
   setlocale(LC_CTYPE, "en_US.UTF-8");
   va_list arg_ptr;

   va_start(arg_ptr, fmt);
   vswprintf(string, 100, fmt, arg_ptr);
   va_end(arg_ptr);
}

int main(void)
{
   setlocale(LC_ALL, "");
#ifdef _WIN32
   int oldmode = _setmode(_fileno(stdout), _O_U16TEXT);
#endif

   wchar_t string[100];

   wchar_t arr[] = { 0x0119, L'\0' };
   vout(string, L"%ls", arr);
   wprintf(L"This string should have 'ę' (e with ogonek / tail) after colon:  %ls\r\n", string);

#ifdef _WIN32
   _setmode(_fileno(stdout), oldmode);
#endif
   return 0;
}

Alternatively, we can use fwprintf and provide stdout as first argument. To do the same with fwprintf(stderr,format,args) (or perror(format, args)), we would need to _setmode the stderr as well.

来源：https://stackoverflow.com/questions/45449346/wchar-t-with-utf8-chars-in-msvc

标签

visual-c++

utf-8

wchar-t