You are calling decode
on a unicode
string. Python helpfully first encodes the string using the default ASCII codec so that you have actual bytes to decode. You cannot decode Unicode data itself, it is already decoded.
That decoding then fails as the bytes are not valid UTF-32 data. The bytestring 'abcd'
is decodable as UTF-8, because ASCII is a subset of UTF-8. Encoding to ASCII then decoding as UTF-8 produces the same information. Decoding as UTF-16 happened to work by chance; you provided 4 bytes with hex values 0x61, 0x62, 0x63 and 0x64 (the ASCII values for the characters abcd
), and those bytes can be decoded as UTF-16 little endian for \u6261
and \u6463
. But there is no valid decoding for those 4 bytes in the UTF-32 encoding system.
If s
had data in it that cannot be encoded to ASCII first, you'll get a UnicodeEncodeError
exception; note the Encode in that name:
>>> u'åßç'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)
because the implicit encoding to a bytestring failed.
In Python 3, unicode
objects have been renamed to str
, and the str.decode()
method has been removed from the type to prevent this kind of confusion. Only str.encode()
remains. The Python str
type has been replaced by the bytes
type, which only has an bytes.decode()
method.
Your second example shows that you are using the Python interpreter interactively in a terminal or console. Python received your input from the terminal as UTF-8 bytes and stored those bytes in a bytestring. Had you used a unicode
literal, Python would have automatically decoded those bytes using the encoding declared for your terminal; you can introspect sys.stdin.encoding
to see what Python detected:
>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> s = '≈'
>>> s
'\xe2\x89\x88'
>>> s = u'≈'
>>> s
u'\u2248'
>>> print s
≈
Vice-versa, when printing the sys.stdout.encoding
codec is used to auto-encode Unicode strings to the codec used by your terminal, which then interprets those bytes again to display the right glyphs on your screen.
If you are not working in the Python interactive interpreter but are instead working with a Python source file, the codec to use is instead determined by the PEP-263 Python source code encodings declaration, as Python 2 otherwise defaults to decoding bytes as ASCII.
sys.getfilesystemencoding()
has nothing to do with all this; it tells you what Python think your file system metadata is encoded with; e.g. the filenames in directories. The values is used when you use unicode
paths for filesystem-related calls like os.listdir()
.