I\'ve never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-3
UTF-16 allows all of the basic multilingual plane (BMP) to be represented as single code units. Unicode code points beyond U+FFFF are represented by surrogate pairs.
The interesting thing is that Java and Windows (and other systems that use UTF-16) all operate at the code unit level, not the Unicode code point level. So the string consisting of the single character U+1D122 (MUSICAL SYMBOL F CLEF) gets encoded in Java as "\ud824\udd22" and "\ud824\udd22".length() == 2
(not 1
). So it's kind of a hack, but it turns out that characters are not variable length.
The advantage of UTF-16 over UTF-8 is that one would give up too much if the same hack were used with UTF-8.