Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

前端 未结 8 1217
半阙折子戏
半阙折子戏 2021-02-09 06:23

Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (alway
8条回答
  •  春和景丽
    2021-02-09 07:01

    Short answer: No.

    The size of a wchar_t—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.

    In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t to contain big-endian data.

    Furthermore, even though each wchar_t is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_ts, and there may be more than one way to represent it.

    A common example is the character é (LATIN SMALL LETTER E WITH ACUTE), code point 0x00E9. This can also be represented as “decomposed” code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). Both are valid; see the Wikipedia article on Unicode equivalence for more information.

    Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_ts.

    On Linux/Mac UTF-8 (with chars) is more common and APIs usually take UTF-8. wchar_t is seen to be wasteful because it uses 4 bytes per character.

    For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.

    Update

    The question has been updated to ask what Windows APIs mean when they ask for the “number of characters” in a string.

    If the API says “size of the string in characters” they are referring to the number of wchar_ts (or the number of chars if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t. Those APIs are just looking to fill a buffer and need to know how much room they have.

提交回复
热议问题