What is the difference between UTF-8 and Unicode?

前端未结

关注

 15  1015

独厮守ぢ

I have heard conflicting opinions from people - according to the Wikipedia UTF-8 page.

They are the same thing, aren\'t they? Can someone clarify?

相关标签:

15条回答

遇见更好的自我

2020-11-22 17:30

Unicode only define code points, that is, a number which represents a character. How you store these code points in memory depends of the encoding that you are using. UTF-8 is one way of encoding Unicode characters, among many others.

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2020-11-22 17:30

I have checked the links in Gumbo's answer, and I wanted to paste some part of those things here to exist on Stack Overflow as well.

"...Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole other story..."

"...Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041...."

"...OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven't yet said anything about how to store this in memory or represent it in an email message..."

"...That's where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ? ..."

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-11-22 17:35

UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

Unicode is a standard for representing a great variety of characters from many languages.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2 3