问题
I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :
UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence
Other characters before this are printed out correctly. The code is:
parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
print elem[0].text
Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?
回答1:
The stacktrace of the UnicodeEncodeError
points to the location where the exception occurs.
Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932
encoding on your system.
If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).
回答2:
I had a similar situation using lxml's objectify. Here's how I was able to fix it.
import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
# Decode to string.
my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')
来源:https://stackoverflow.com/questions/13765614/lxml-encoding-error-when-parsing-utf8-xml