Python ElementTree won't convert non-breaking spaces when using UTF-8 for output

后端 未结 5 1667
执笔经年
执笔经年 2021-02-20 14:45

I\'m trying to parse, manipulate, and output HTML using Python\'s ElementTree:

import sys
from cStringIO  import StringIO
from xml.etree  import ElementTree as E         


        
5条回答
  •  礼貌的吻别
    2021-02-20 15:39

    I think the problem you have here is not with your nbsp entity but with your print statement.

    Your error is:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)

    I think this is because you're taking a utf-8 string (from ET.tostring(p, encoding='utf-8')) and trying to echo it out in a ascii terminal. So Python is implicitly converting that string to unicode then converting it again to ascii. Although nbsp can be represented directly in utf-8, it cannot be represented directly in ascii. Hence the error.

    Try saving the output to a file instead and seeing if you get what you expect.

    Alternatively, try print ET.toString(p, encoding='ascii'), which should cause ElementTree to use numeric character entities to represent anything that can't be represented with ascii.

提交回复
热议问题