using beautifulsoup with html5lib, it puts the html, head and body tags automatically:
BeautifulSoup(\'FOO
\', \'html5lib\') # => <
Yet another solution:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
a['href'] = 'http://stackoverflow.com/'
a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
Since v4.0.1 there's a method decode_contents()
:
>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').decode_contents()
'<h1>FOO</h1>'
More details in a solution to this question: https://stackoverflow.com/a/18602241/237105