What's the difference between an “encoding,” a “character set,” and a “code page”?

问题

I'm really trying to get better with this stuff. I'm pretty functional with internationalization concepts like this, but I need to get a better background on the theory behind it.

I've read Spolsky's article, but I'm still unclear because these three terms get used interchangeably a LOT -- even in that article. I think at least two of them are talking about the same thing.

I suspect a high percentage of developers flub their way through this stuff on a daily basis. I don't want to be one of those developers anymore.

回答1:

A ‘character set’ is just what it says: a properly-specified list of distinct characters.

An ‘encoding’ is a mapping between a character set (typically Unicode today) and a (usually byte-based) technical representation of the characters.

UTF-8 is an encoding, but not a character set. It is an encoding of the Unicode character set(*).

The confusion comes about because most other well-known encodings (eg.: ISO-8859-1) started out as separate character sets. Then when Unicode came along as a superset of most of these character sets, it became possible to think of them as different (but partial) encodings of the same (Unicode) character set, rather than just isolated character sets. Looking at them this way allows you to convert between them through Unicode easily, which would not be possible if they were merely isolated character sets. But it still makes sense to refer to them as character sets, so either term could be used.

A ‘code page’ is a term stemming from IBM, where it chose which set of symbols would be displayed. The term continued to be used by DOS and then Windows, through to Unicode-aware Windows where it just acts as an encoding with a numbered identifier. Whilst a numbered ‘code page’ is an idea not inherently limited to Microsoft, today the term would almost always just mean an encoding that Windows knows about.

When one is talking of code page ‹some number› one is typically talking about a Windows-specific encoding, as distinct from an encoding devised by a standards body. For example code page 28591 would not normally be referred to under that name, but simply ‘ISO-8859-1’. The Windows-specific Western European encoding based on ISO-8859-1 (with a few extra characters replacing some of its control codes) would normally be referred to as ‘code page 1252’.

[*: All the UTFs are encodings not character sets, but this kind of thing isn't exclusive to Unicode. For example the Japanese standard JIS X 0208 defines a character set and two different byte encodings for it: the somewhat unpleasant high-byte-based encoding (‘Shift-JIS’), and the deeply horrific escape-switching-based encoding (‘JIS’).]

回答2:

A Character Set is just that, a set of characters that can be used.
Each of these characters is mapped to an integer called code point.
How these code points are represented in memory is the encoding. An encoding is just a method to transform a code-point (U+0041 - Unicode code-point for the character 'A') into raw data (bits and bytes).

回答3:

A character set is a set of characters, i.e. "glyphs" i.e. visual symbols representing units of communication. The letter a is a glyph and so is € (euro sign). Character sets usually map integers (codepoints) to each character, but it's the encoding that dictates the binary representation of the character.

I'm a ruby programmer, so here are some examples to help you understand the concepts.

This reveals how Unicode maps codepoints to characters, but not how each byte is stored. (ruby 1.9 defaults to Unicode strings.)

>> 'a'.codepoints.to_a
=> [97]
>> '€'.codepoints.to_a
=> [8364]

The following reveals how the UTF-8 encoding stores each character as bytes (0 thru 255 in base-10). (Ruby 1.9's default encoding is UTF-8.) Since 8364 (base 10) is too large to fit in one byte, UTF-8 has a specific strategy for breaking it into multiple bytes. Wikipedia shows the UTF-8 encoding algorithm, if you want to delve into the implementation.

>> 'a'.bytes.to_a
=> [97]
>> '€'.bytes.to_a
=> [226, 130, 172]

Here's the same thing in ISO-8859-15 char set:

>> 'a'.encode('iso-8859-15').codepoints.to_a
=> [97]
>> '€'.encode('iso-8859-15').codepoints.to_a
=> [164]

And the ISO-8859-15 encoding:

>> 'a'.encode('iso-8859-15').bytes.to_a
=> [97]
>> '€'.encode('iso-8859-15').bytes.to_a
=> [164]

Notice that the ISO-8859-15 codepoints match the byte representation.

Here's a blog entry that might be helpful: http://blog.grayproductions.net/articles/what_is_a_character_encoding . Entries 1 thru 3 are good if you don't want to get too ruby-specific.

回答4:

I thought Joel's article was pretty much spot on - it is the history behind the evolution of character sets and storage which has brought this about.

FWIW, in my oversimplistic view

Character Sets (ASCII, EBCDIC, UNICODE) would be the numeric representation of characters, independent of storage considerations
Encoding would relate to the efficient storage of characters, ANSI, UTF-7, UTF-8 etc, for file, across the wire etc
Code Page would be the 'kluge' needed when the demand for the addition of new characters (without wanting to increase storage capacity) meant that (certain) characters were only knowable in the additional context of a code page.

IMHO Wikipedia currently doesn't help things by defining code page as 'another name for character encoding' and redirecting 'character set' to 'character encoding'

回答5:

The chapter on Unicode in this book, Advanced Perl Programming contains the best description of encoding, character sets and the other entities of unicode that I've come across. Unfortunately I don't think its available for free on line.

来源：https://stackoverflow.com/questions/3441490/whats-the-difference-between-an-encoding-a-character-set-and-a-code-page

标签

encoding

codepages