I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.
They are the same thing, aren\'t they? Can someone clarify?
UTF-8 is one possible encoding scheme for Unicode text.
Unicode is a broad-scoped standard which defines over 140,000 characters and allocates each a numerical code (a code point). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.
There is more than one way that a string of Unicode code points can be encoded into a binary stream. These are called "encodings". The most straightforward encoding is UTF-32, which simply stores each code point as a 32-bit integer, with each being 4 bytes wide.
UTF-8 is another encoding, and is becoming the de-facto standard, due to a number of advantages over UTF-32 and others. UTF-8 encodes each code point as a sequence of either 1, 2, 3 or 4 byte values. Code points in the ASCII range are encoded as a single byte value, to be compatible with ASCII. Code points outside this range use either 2, 3, or 4 bytes each, depending on what range they are in.
UTF-8 has been designed with these properties in mind:
ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters.
Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order.
When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature.
UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8.
Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.
If I may summarise what I gathered from this thread:
Unicode 'translates' characters to ordinal numbers (in decimal form).
à = 224
UTF-8 is an encoding that 'translates' these numbers to binary representations.
224 = 11000011 10100000
Note that we're talking about the binary representation of 224, not its binary form, which is 0b11100000.
There're lots of characters around the world,like "$,&,h,a,t,?,张,1,=,+...".
Then there comes an organization who's dedicated to these characters,
They made a standard called "Unicode".
The standard is like follows:
PS:Of course there's another organization called ISO maintaining another standard --"ISO 10646",nearly the same.
As above,U+0024 is just a position,so we can't save "U+0024" in computer for the character "$".
There must be an encoding method.
Then there come encoding methods,such as UTF-8,UTF-16,UTF-32,UCS-2....
Under UTF-8,the code point "U+0024" is encoded into 00100100.
00100100 is the value we save in computer for "$".
"Unicode" is unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.
UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).
When "Unicode" is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their "native" character encoding. This leads to hairy problems if you need to worry about characters which can't be encoded in a single UTF-16 value (they're encoded as "surrogate pairs") - but most developers never worry about this, IME.
Some references on Unicode:
Let me use an example to illustrate this topic:
A chinese character: 汉
it's unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!
But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of "encoding" to tell the computer to treat it as one.
This is where the rules of 'UTF-8' comes in: http://www.fileformat.info/info/unicode/utf8.htm
Binary format of bytes in sequence
1st Byte 2nd Byte 3rd Byte 4th Byte Number of Free Bits Maximum Expressible Unicode Value
0xxxxxxx 7 007F hex (127)
110xxxxx 10xxxxxx (5+6)=11 07FF hex (2047)
1110xxxx 10xxxxxx 10xxxxxx (4+6+6)=16 FFFF hex (65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (3+6+6+6)=21 10FFFF hex (1,114,111)
According to the table above, if we want to store this character using the 'UTF-8' format, we need to prefix our character with some 'headers'. Our chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:
Header Place holder Fill in our Binary Result
1110 xxxx 0110 11100110
10 xxxxxx 110001 10110001
10 xxxxxx 001001 10001001
Writing out the result in one line:
11100110 10110001 10001001
This is the UTF-8 (binary) value of the chinese character! (confirm it yourself: http://www.fileformat.info/info/unicode/char/6c49/index.htm)
A chinese character: 汉
it's unicode value: U+6C49
convert 6C49 to binary: 01101100 01001001
embed 6C49 as UTF-8: 11100110 10110001 10001001
P.S. If you want to learn this topic in python, click here
They are the same thing, aren't they?
No, they aren't.
I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:
UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.
To elaborate:
Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.
! -> U+0021 (21),
" -> U+0022 (22),
\# -> U+0023 (23)
UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.
Joel gives a really nice explanation and an overview of the history here.