lxml encoding error when parsing utf8 xml

问题

I'm trying to iterate through an XML file (UTF-8 encoded, starts with ) with lxml, but get the following error on the character 丂 :

UnicodeEncodeError: 'cp932' codec can't encode character u'\u4e02' in position 0: illegal multibyte sequence

Other characters before this are printed out correctly. The code is:

parser = etree.XMLParser(encoding='utf-8')
tree = etree.parse("filename.xml", parser)
root = tree.getroot()
for elem in root:
    print elem[0].text

Does the error mean that it didn't parse the file in utf-8 but in shift JIS instead?

回答1:

The stacktrace of the UnicodeEncodeError points to the location where the exception occurs. Unfortunately you didn’t include it but it’s most likely the last line where the unicode text is printed to stdout. I assume that stdout uses cp932 encoding on your system.

If my assumptions are correct you should consider changing your environment such that stdout uses an encoding that can represent unicode characters (like UTF-8). (see for example Writing unicode strings via sys.stdout in Python).

回答2:

I had a similar situation using lxml's objectify. Here's how I was able to fix it.

import unicodedata
my_name = root.name.text
if isinstance(my_name, unicode):
    # Decode to string.
    my_name = unicodedata.normalize('NFKD', my_name).encode('ascii','ignore')

来源：https://stackoverflow.com/questions/13765614/lxml-encoding-error-when-parsing-utf8-xml

标签

xml

encoding

utf-8

lxml

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!