parse html body fragment in lxml

前端 未结 1 1971
庸人自扰
庸人自扰 2021-01-14 02:36

I\'m trying to parse a fragment of html:

title

I use lxml.html.from

1条回答
  •  孤街浪徒
    2021-01-14 03:15

    .fragment_fromstring() removes the tag as well; basically, whenever you do not have a HTML document (with a top-level element and/or a doctype), .fromstring() falls back to .fragment_fromstring() and that method removes both the and the tags, always.

    The work-around is to tell .fragment_fromstring() to give you a parent tag:

    >>> lxml.html.fragment_fromstring('

    a

    ', create_parent='body')

    This does not preserve any attributes on the original tag.

    Another work-around is to use the .document_fromstring() method, which will wrap your document in a tag, which you then can remove again:

    >>> lxml.html.document_fromstring('

    a

    ')[0]

    This does preserve attributes on the :

    >>> lxml.html.document_fromstring('

    a

    ')[0].attrib {'class': 'foo'}

    Using the .document_fromstring() function on your first example gives:

    >>> body = lxml.html.document_fromstring('

    title

    ')[0] >>> lxml.html.tostring(body) '

    title

    '

    If you only want to do this if there is no HTML tag, do what lxml.html.fromstring() does and test for a full document:

    htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
    if htmltest(inputtext):
        tree = lxml.html.fromstring(inputtext)
    else:
        tree = lxml.html.document_fromstring(inputtext)[0]
    

    0 讨论(0)
提交回复
热议问题