I\'ve seen that >2 byte unicode codepoints like U+10000 can be written as a pair, like \\uD800\\uDC00
. They seem to start with the nibble d
, but th
UTF-8 means (using my own words) that the minimum atom of processing is a byte (the code unit is 1-byte long). I don't know if historically, but at least, conceptually spoken, the UCS-2 and UCS-4 Unicode encodings come first, and UTF-8/UTF-16 appear to solve some problems of UCS-*.
UCS-2 means that each character uses 2 bytes instead of one. It's a fixed-length encoding. UCS-2 saves the bytestring of each code point as you say. The problem is there are characters which codepoints require more than 2 bytes to store it. So, UCS-2 only can handle a subset of Unicode (the range U+0000 to U+FFFF of course).
UCS-4 uses 4 bytes for each character instead, and it's capable enough to store the bitstring of any Unicode code point, obviously (the Unicode range is from U+000000 to U+10FFFF).
The problem with UCS-4 is that characters outside the 2-bytes range are very, very uncommon, and any text encoded using UCS-4 will waste too much space. So, using UCS-2 is a better approach, unless you need characters outside the 2-bytes range.
But again, English texts, source code files and so on use mostly ASCII characters and UCS-2 has the same problem: wasting too much space for texts which use mostly ASCII characters (too many useless zeros).
That is what UTF-8 does. Characters inside the ASCII range are saved in UTF-8 texts as-is. It just takes the bitstring of the code point/ASCII value of each character. So, if a UTF-8 encoded text uses only ASCII characters, it is indistinguishable from any other Latin1 encoding. Clients without UTF-8 support can handle UTF-8 texts using only ASCII characters, because they look identical. It's a backward compatible encoding.
From then on (Unicode characters outside the ASCII range), UTF-8 texts use two, three or four bytes to save code points, depending on the character.
I don't know the exact method, but the bitestring is split in two, three or four bytes using known bit prefixes to know the amount of bytes used to save the code point. If a byte begins with 0, means the character is ASCII and uses only 1 byte (the ASCII range is 7-bits long). If it begins with 1, the character is encoded using two, three or four bytes depending on what bit comes next.
The problem with UTF-8 is that it requires too much processing (it must examine the first bits of each character to know its length), specially if the text is not English-like. For example, a text written in Greek will use mostly two-byte characters.
UTF-16 uses two-bytes code units to solve that problem for non-ASCII texts. That means that the atoms of processing are 16-bit words. If a character encoding doesn't fit in a two-byte code unit, then it uses 2 code units (four bytes) to encode the character. That pair of two code units is called a surrogate pair. I think a UTF-16 text using only characters inside the 2-byte range is equivalent to the same text using UCS-2.
UTF-32, in turn, uses 4-bytes code units, as UCS-4 does. I don't know the differences between them though.
The complete picture filling in your confusion is formatted below:
Referencing what I learned from the comments...
U+10000
is a Unicode code point (hexadecimal integer mapped to a character).
Unicode is a one-to-one mapping of code points to characters.
The inclusive range of code points from 0xD800
to 0xDFFF
is reserved for UTF-161 (Unicode vs UTF) surrogate units (see below).
\uD800\uDC00
2 are two such surrogate units, called a surrogate pair. (A surrogate unit is a code unit that's part of a surrogate pair.)
Abstract representation: Code point (abstract character) -->
Code unit (abstract UTF-16) -->
Code unit (UTF-16 encoded bytes) -->
Interpreted UTF-16
Actual usage example: Input data is bytes and may be wrapped in a second encoding, like ASCII for HTML entities and unicode escapes, or anything the parser handles -->
Encoding interpreted; mapped to code point via scheme -->
Font glyph -->
Character on screen
How surrogate pairs work
Surrogate pair advantages:
(2048/2)**2
to yield a range of 1048576 code points.1 UTF-16 is the only UTF which uses surrogate pairs.
2 This is formatted as a unicode escape sequence.
Graphics describing character encoding:
Keep reading: