Please clarify for me, how does UTF16 work? I am a little confused, considering these points:
Short story: UTF-16 is a variable-length encoding. A single character may be one or two widechars long.
HOWEVER, you may very well get away with treating it as a fixed-length encoding where every character is one widechar (2 bytes). This is formally called UCS-2, and it used to be Win32's assumption until Windows NT 4. The UCS-2 charset includes practically all living, dead and constructed human languages. And truth be told, working with variable-length encoding strings just sucks. Iteration becomes O(n) operation, string length is not the same as string size, etc. Any sensible parsing becomes a pain.
As for the UTF-16 chars that are not in UCS-2... I only know two subsets that may theoretically come up in real life. First is emoji - the graphical smileys that are popular in the Japanese cell phone culture. On iPhone, there's a bunch of third-party apps that enable input of those. Except on mobile phones, they don't display properly. The other character class is VERY obscure Chinese characters. The ones even most Chinese don't know. All the popular Chinese characters are well inside UCS-2.
According to the Unicode FAQ it could be
one or two 16-bit code units
Windows uses 16 bit chars - probably as Unicode was originally 16 bit. So you don't have an exact map - but you might be able to get away with treating all strings you see as just containing 16 but unicode characters,
You seem to have several misconception.
There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)
This is wrong. Assuming you refer to the c++ type wchar_t
- It is not always 2 bytes long, 4 bytes is also a common value, and there's no restriction that it can be only those two values. If you don't refer to that, it isn't in C++ but is some platform-specific type.
There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.
UTF-8 and UTF-16 are different encodings for the same character set, so UTF-16 is not "bigger". Technically, the scheme used in UTF-8 could encode more characters than the scheme used in UTF-16, but as UTF-8 and UTF-16 they encode the same set.
Don't use the term "character" lightly when it comes to unicode. A codeunit in UTF-16 is 2 bytes wide, a codepoint is represented by 1 or 2 codeunits. What humans usually understand as "characters" is different and can be composed of one or more codepoints, and if you as a programmer confuse codepoints with characters bad things can happen like http://ideone.com/qV2il
Windows' WCHAR
is 16 bits (2 bytes) long.
A Unicode codepoint may be represented by one or two of these WCHAR
– 16 or 32 bits (2 or 4 bytes).
wcslen
returns number of WCHAR
units in a wide string, while wcslen_l
returns the number of (locale-dependent) codepoints. Obviously, wcslen <= wcslen_l
.
A Unicode character may consist of multiple combining codepoints.
All characters in the Basic Multilingual Plane will be 2 bytes long.
Characters in other planes will be encoded into 4 bytes each, in the form of a surrogate pair.
Obviously, if a function does not try to detect surrogate pairs and blindly treats each pair of bytes as a character, it will bug out on strings that contain such pairs.
Short answer: No.
The size of a wchar_t
—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.
In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t
to contain big-endian data.
Furthermore, even though each wchar_t
is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_t
s, and there may be more than one way to represent it.
A common example is the character é (LATIN SMALL LETTER E WITH ACUTE
), code point 0x00E9. This can also be represented as “decomposed” code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E
followed by COMBINING ACUTE ACCENT
). Both are valid; see the Wikipedia article on Unicode equivalence for more information.
Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_t
s.
On Linux/Mac UTF-8 (with char
s) is more common and APIs usually take UTF-8. wchar_t
is seen to be wasteful because it uses 4 bytes per character.
For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.
The question has been updated to ask what Windows APIs mean when they ask for the “number of characters” in a string.
If the API says “size of the string in characters” they are referring to the number of wchar_t
s (or the number of char
s if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t
. Those APIs are just looking to fill a buffer and need to know how much room they have.