问题
I found an interesting article "A tutorial on character code issues" (http://jkorpela.fi/chars.html#code) which explains the terms "character code"/"code point" and "character encoding".
The former is just an integer number which is assigned to an character. For example 65 to character A. The character encoding defines how such an code point is represented via one ore more bytes.
For the good old ASCII the autor says: "The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value. "
So 65 which is the code point for A would be encoded as 1000 0001.
Because I have 127 characters in ASCII there are 127 code points where each code point is always encoded by one byte.
If I summarize this I have the following steps to encode characters in ASCII:
- Assign a number (code point) to each character (e.g. A->65)
- Encode the character with a byte which has the same value (e.g. 1000 0001)
So for the letter A and B it would be
A -> 65 -> 1000 0001 B -> 66 -> 1000 0010
My question is:
Why this separation of code points and encoding in ASCII? ASCII has only one encoding. So at least for ASCII it is not clear for me why the intermediate step (map to integer) is done. A direct encoding like
A -> 1000 0001 B -> 1000 0010
would also be possible or not? If I would have multiple encodings for an ASCII character the separation would be reasonable but with only one encoding form it doesn't make sense for me.
回答1:
You're right. Each concept doesn't necessarily require a discernable implementation for a particular encoding. But when discussing character sets and encodings in general, it's good to have all the concepts distinguished.
Actually, you could consider ASCII to have two encodings, one 7-bit and one 8-bit. 7-bit was used along with a scheme that has a parity bit in the 8th bit of a byte. Unicode is notable for having many encodings, including UTF-8, UTF-16 and UTF-32.
There is a missing term: Code Unit. An encoding maps a codepoint to a sequence of code units. Code units are integers of a fixed size. As you may know, integers larger than 8 bits have a byte ordering (aka endianness). This leads to UTF-16 and UTF-32 having big endian and little endian variants.
Fundamental rule for computerized text: Read with the encoding that the file or stream was written with. Bytes that represent text must be accompanied by knowledge of the encoding, which comes from a declaration, standard, convention, specification, ….
There are 128 codepoints in ASCII. Most of the time ASCII is mentioned, it is not correct. Ask for the specification that says ASCII or for a correction.
来源:https://stackoverflow.com/questions/47116818/ascii-code-point-vs-character-encoding