I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What i
As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.
As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++
_MBCS and _UNICODE are macros to determine which version of TCHAR.H routines to call. For example, if you use _tcsclen
to count the length of a string, the preprocessor would map _tcsclen
to different version according to the two macros: _MBCS and _UNICODE.
_UNICODE & _MBCS Not Defined: strlen
_MBCS Defined: _mbslen
_UNICODE Defined: wcslen
To explain the difference of these string length counting functions, consider following example.
If you have a computer box that run Windows Simplified Chinese edition which use GBK(936 code page), you compile a gbk-file-encoded source file and run it.
printf("%d\n", _mbslen((const unsigned char*)"I爱你M"));
printf("%d\n", strlen("I爱你M"));
printf("%d\n", wcslen((const wchar_t*)"I爱你M"));
The result would be 4 6 3
.
Here is the hexdecimal representation of I爱你M
in GBK.
GBK: 49 B0 AE C4 E3 4D 00
_mbslen knows this string is encoded in GBK, so it could intepreter the string correctly and get the right result 4
words: 49
as I
, B0 AE
as 爱
, C4 E3
as 你
, 4D
as M
.
strlen only knows 0x00
, so it get 6
.
wcslen consider this hexdeciaml array is encoded in UTF16LE, and it count two bytes as one word, so it get 3
words: 49 B0
, AE C4
, E3 4D
.
as @xiaokaoy pointed out, the only valid terminator for wcslen
is 00 00
. Thus the result is not guranteed to be 3
if the following byte is not 00
.
MBCS means Multi-Byte Character Set and describes any character set where a character is encoded into (possibly) more than 1 byte.
The ANSI / ASCII character sets are not multi-byte.
UTF-8, however, is a multi-byte encoding. It encodes any Unicode character as a sequence of 1, 2, 3, or 4 octets (bytes).
However, UTF-8 is only one out of several possible concrete encodings of the Unicode character set. Notably, UTF-16 is another, and happens to be the encoding used by Windows / .NET (IIRC). Here's the difference between UTF-8 and UTF-16:
UTF-8 encodes any Unicode character as a sequence of 1, 2, 3, or 4 bytes.
UTF-16 encodes most Unicode characters as 2 bytes, and some as 4 bytes.
It is therefore not correct that Unicode is a 16-bit character encoding. It's rather something like a 21-bit encoding (or even more these days), as it encompasses a character set with code points U+000000
up to U+10FFFF
.
I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?
Many functions in the Windows API come in two versions: One that takes char
parameters (in a locale-specific code page) and one that takes wchar_t
parameters (in UTF-16).
int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);
Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE
macro is defined.
#ifdef UNICODE
#define MessageBox MessageBoxW
#else
#define MessageBox MessageBoxA
#endif
In order to make this work, the TCHAR
type is defined to abstract away the character type used by the API functions.
#ifdef UNICODE
typedef wchar_t TCHAR;
#else
typedef char TCHAR;
#endif
This, however, was a bad idea. You should always explicitly specify the character type.
What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?
MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.
But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.
To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar
, call the "W" version of the function, and call WideCharToMultiByte
on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.
This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.
Unicode is a 16-bit character encoding
This negates whatever I read about the Unicode.
MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)
Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.