I\'m trying to parse, manipulate, and output HTML using Python\'s ElementTree:
import sys
from cStringIO import StringIO
from xml.etree import ElementTree as E
0xA0 is a latin1 character, not a unicode character and the value of p.text in the loop is a str and not unicode, that means that in order to encode it in utf-8 it must first be converted by Python implicitly into a unicode string (i.e. using decode). When it is doing this it assumes ascii since it wasn't told anything else. 0xa0 is not a valid ascii character, but it is a valid latin1 character.
The reason you have latin1 characters instead of unicode characters is because entitydefs is a mapping of names to latin1 encode strings. You need the unicode code point which you can get from htmlentitydef.name2codepoint
The version below should fix it for you:
import sys
from cStringIO import StringIO
from xml.etree import ElementTree as ET
from htmlentitydefs import name2codepoint
source = StringIO("""
Less than <
Non-breaking space
""")
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity.update((x, unichr(i)) for x, i in name2codepoint.iteritems())
etree = ET.ElementTree()
tree = etree.parse(source, parser=parser)
for p in tree.findall('.//p'):
print ET.tostring(p, encoding='UTF-8')