What is the difference between UTF-8 and Unicode?

前端 未结 15 1015
独厮守ぢ
独厮守ぢ 2020-11-22 17:08

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren\'t they? Can someone clarify?

相关标签:
15条回答
  • 2020-11-22 17:30

    Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.

    0 讨论(0)
  • 2020-11-22 17:30

    I have checked the links in Gumbo's answer, and I wanted to paste some part of those things here to exist on Stack Overflow as well.

    "...Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

    In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

    Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:

    A -> 0100 0001

    In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole other story..."

    "...Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041...."

    "...OK, so say we have a string:

    Hello

    which, in Unicode, corresponds to these five code points:

    U+0048 U+0065 U+006C U+006C U+006F.

    Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message..."

    "...That's where encodings come in.

    The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

    00 48 00 65 00 6C 00 6C 00 6F

    Right? Not so fast! Couldn't it also be:

    48 00 65 00 6C 00 6C 00 6F 00 ? ..."

    0 讨论(0)
  • 2020-11-22 17:35

    UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

    Unicode is a standard for representing a great variety of characters from many languages.

    0 讨论(0)
提交回复
热议问题