I\'m trying to parse, manipulate, and output HTML using Python\'s ElementTree:
import sys
from cStringIO import StringIO
from xml.etree import ElementTree as E
I think the problem you have here is not with your nbsp entity but with your print statement.
Your error is:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 19: ordinal not in range(128)
I think this is because you're taking a utf-8 string (from ET.tostring(p, encoding='utf-8')
) and trying to echo it out in a ascii terminal. So Python is implicitly converting that string to unicode then converting it again to ascii. Although nbsp can be represented directly in utf-8, it cannot be represented directly in ascii. Hence the error.
Try saving the output to a file instead and seeing if you get what you expect.
Alternatively, try print ET.toString(p, encoding='ascii')
, which should cause ElementTree to use numeric character entities to represent anything that can't be represented with ascii.