You could find an answer here -- Getting international characters from a web page?
EDIT: It seems like BeautifulSoup
doesn't convert entities written in hexadecimal form. It can be fixed:
import copy, re
from BeautifulSoup import BeautifulSoup
hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'),
lambda m: '&#%d;' % int(m.group(1), 16))]
def convert(html):
return BeautifulSoup(html,
convertEntities=BeautifulSoup.HTML_ENTITIES,
markupMassage=hexentityMassage).contents[0].string
html = '<html>ǎǎ</html>'
print repr(convert(html))
# u'\u01ce\u01ce'
EDIT:
unescape() function mentioned by @dF which uses htmlentitydefs
standard module and unichr()
might be more appropriate in this case.