Python 3 and b'\x92'.decode('latin1')

问题

I'm getting results I didn't expect from decoding b'\x92' with the latin1 codec. See the session below:

Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
>>> b'\xa3'.decode('latin1').encode('ascii', 'namereplace')
b'\\N{POUND SIGN}'
>>> b'\x92'.decode('latin1').encode('ascii', 'namereplace')
b'\\x92'
>>> ord(b'\x92'.decode('latin1'))
146

The result decoding b'\xa3' gave me exactly what I was expecting. But the two results for b'\x92' were not what I expected. I was expecting b'\x92'.decode('latin1') to result in U+2018, but it seems to be returning U+0092.

What am I missing?

回答1:

I just want to make clear that you're not encoding anything here.

xa3 has a ordinal value of 163 (0xa3 in hexadecimal). Since that ordinal is not seven bits, it can't be encoded into ascii. Your handler for errors just replaces the Unicode Character into the name of the character. The Unicode Character 163 maps to £.

'\x92' on the other hand, has an ordinal value of 146. According to this Wikipedia Article, the character isn't printable - it's a privately used control code in the C2 space. This explains why it's name is simply the literal '\\x92'.

As an aside, if you need the name of the character, it's much better to do it like this:

import unicodedata
print unicodedata.name(u'\xa3')

回答2:

The error I made was to expect that the character 0x92 decoded to "RIGHT SINGLE QUOTATION MARK" in latin-1, it doesn't. The confusion was caused because it was present in a file that was specified as being in latin1 encoding. It now appears that the file was actually encoded in windows-1252. This is apparently a common source of confusion:

http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

If the character is decoded with the correct encoding, then the expected result is obtained.

>>> b'\x92'.decode('windows-1252').encode('ascii', 'namereplace')
b'\\N{RIGHT SINGLE QUOTATION MARK}'

回答3:

I was expecting b'\x92'.decode('latin1') to result in U+2018

latin1 is an alias for ISO-8859-1. In that encoding, byte 0x92 maps to character U+0092, an unprintable control character.

The encoding you might have really meant is windows-1252, the Microsoft Western code page based on it. In that encoding, 0x92 is U+2019 which is close...

(Further befuddlement arises because for historical reasons web browsers are also confused between the two. When a web page is served as charset=iso-8859-1, web browsers actually use windows-1252.)

来源：https://stackoverflow.com/questions/39968891/python-3-and-b-x92-decodelatin1

标签

python

python-3.x

unicode

python-unicode