I have done very little with encoding of Text. Truthfully, I don\'t really even know what it means exactly.
For example, if I have something like:
D
The .NET string class is encoding strings using UTF16 - that means 2 bytes per character (although it allows for special combinations of two characters to form a single 4 byte character, so called "surrogate pairs") .
UTF8 on the other hand will use a variable number of bytes necessary to represent a particular Unicode character, i.e. only one byte for regular ASCII characters, but maybe 3 bytes for a Chinese character. Both encodings allow representing all Unicode characters, so there is always a mapping between them - both are different binary represenations (i.e for storing in memory or on disk) of the same (unicode) character set.
Since not all Unicode characters were able to fit into the original 2 bytes reserved by UTF-16, the format also allows to denote a combination of two UTF-16 characters to form 4 byte characters - the so formed character is called a "surrogate" or surrogate pair and is a pair of 16-bit Unicode encoding values that, together, represent a single character.
UTF-8 does not have this problem, since the number of bytes per Unicode character is not fixed. A good general overview over UTF-8, UTF-16 and BOMs can be gathered here.
An excellent overview / introduction to Unicode character encoding is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
UTF is a specific type of encoding with multiple different sizes. Each encoding type is how much memory and what representation in that memory the characters will take.
Generally we work with Unicode and Ascii.
Unicode is 2 Bytes per character.
Ascii is 1 Byte per character.
Ascii can be represented in unicode. however Unicode cannot be represented in ascii without being encoded.
UTF encoding uses a special character '%' to tell you that the following is the hex value of an encoded character.
%20 for instance is the character 32, which is actually a space.
http://www.google.com?q=space%20character
placing that url in a browser would UTF-8 decode that string and q= would actually be interpreted as "space character" notice the %20 is now a space.
UTF-16 uses 2 Bytes and is represented as such.
http://www.google.com?q=space%0020character
this example would actually fail as the URI is actually supposed to use UTF-8, But this example demonstrates the point.
The Unicode character would be 0020 or two bytes with values 0 and 32 respectively.
Mandarin would be some type of unicode characters, and UTF-16 would encode the Unicode so it would be representable in Ascii.
Here is a wiki article explaining a little more in depth
http://en.wikipedia.org/wiki/UTF-8
First and foremeost: do not despair, you are not alone. Awareness of the treatment of character encoding and text representation in general is an unfortunately uncommon thing, but there is no better time to start learning than right now!
In modern systems, including .NET, text strings are represented in memory by some encoding of Unicode code points. These are just numbers. The code point for the character A
is 65
. The code point for the copyright (c)
is 169
. The code point for the Thai digit six is 3670
.
The term "encoding" refers to how these numbers are represented in memory. There are a number of standard encodings that are used so that textual representation can remain consistent as data is transmitted from one system to another.
A simple encoding standard is UCS-2, whereby the code point is stored in the raw as a 16-bit word. This is limited due to the fact that it can only represent code points 0000-FFFF
and such a range does not cover the full breadth of Unicode code points.
UTF-16 is the encoding used internally by the .NET String
class. Most characters fit into a single 16-bit word here, but values larger than FFFF
are encoded using surrogate pairs (see the Wiki). Because of this encoding scheme, code points D800-DFFF
cannot be enocded by UTF-16.
UTF-8 is perhaps the most popular encoding used today, for a number of reasons which are outlined in the Wiki article.