Python 3 chokes on CP-1252/ANSI reading

只谈情不闲聊 提交于 2019-11-27 16:12:54

Position 0x81 is unassigned in Windows-1252 (aka cp1252). It is assigned to U+0081 HIGH OCTET PRESET (HOP) control character in Latin-1 (aka ISO 8859-1). I can reproduce your error in Python 3.1 like this:

>>> b'\x81'.decode('cp1252')
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 0: character maps to <undefined>

or with an actual file:

>>> open('test.txt', 'wb').write(b'\x81\n')
2
>>> open('test.txt').read()
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 0: unexpected code byte

Now to treat this file as Latin-1 you pass the encoding argument, like codeape suggested:

>>> open('test.txt', encoding='latin-1').read()
'\x81\n'

Beware that there are differences between Windows-1257 and Latin-1 encodings, e.g. Latin-1 doesn't have “smart quotes”. If the file you're processing is a text file, ask yourself what that \x81 is doing in it.

You can relax the error handling.

For instance:

f = open(filename, encoding="...", errors="replace")

Or:

f = open(filename, encoding="...", errors="ignore")

See the docs.

EDIT:

But are you certain that the problem is in reading the file? Could it be that the exception happens when something is written to the console? Check http://wiki.python.org/moin/PrintFails

All files are "not Unicode". Unicode is an internal representation which must be encoded. You need to determine for each file what encoding has been used, and specify that where necessary when the file is opened.

As the traceback and error message indicate, the file in question is NOT encoded in cp1252.

If it is encoded in latin1, the "\x81" that it is complaining about is a C1 control character that doesn't even have a name (in Unicode). Consider latin1 extremely unlikely to be valid.

You say "some of the files are parsed with xml.dom.minidom" -- parsed successfully or unsuccessfully?

A valid XML file should declare its encoding (default is UTF-8) in the first line, and you should not need to specify an encoding in your code. Show us the code that you are using to do the xml.dom.minidom parsing.

"others read directly as iterables" -- sample code please.

Suggestion: try opening some of each type of file in your browser. Then click View and click Character Encoding (Firefox) or Encoding (Internet Explorer). What encoding has the browser guessed [usually reliably]?

Other possible encoding clues: What languages are used in the text in the files? Where did you get the files from?

Note: please edit your question with clarifying information; don't answer in the comments.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!