I\'m trying to parse a fragment of html:
title
I use lxml.html.from
.fragment_fromstring()
removes the tag as well; basically, whenever you do not have a HTML document (with a
top-level element and/or a doctype),
.fromstring()
falls back to .fragment_fromstring()
and that method removes both the and the
tags, always.
The work-around is to tell .fragment_fromstring()
to give you a parent tag:
>>> lxml.html.fragment_fromstring('a
', create_parent='body')
This does not preserve any attributes on the original tag.
Another work-around is to use the .document_fromstring()
method, which will wrap your document in a tag, which you then can remove again:
>>> lxml.html.document_fromstring('a
')[0]
This does preserve attributes on the :
>>> lxml.html.document_fromstring('a
')[0].attrib
{'class': 'foo'}
Using the .document_fromstring()
function on your first example gives:
>>> body = lxml.html.document_fromstring('title
')[0]
>>> lxml.html.tostring(body)
'title
'
If you only want to do this if there is no HTML tag, do what lxml.html.fromstring()
does and test for a full document:
htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
if htmltest(inputtext):
tree = lxml.html.fromstring(inputtext)
else:
tree = lxml.html.document_fromstring(inputtext)[0]