When parsing html why do I need item.text sometimes and item.text_content() others

后端 未结 2 1885
花落未央
花落未央 2021-01-17 19:23

Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not s

相关标签:
2条回答
  • 2021-01-17 19:52

    Accordng to the docs the text_content method:

    Returns the text content of the element, including the text content of its children, with no markup.

    So for example,

    import lxml.html as lh
    data = """<a><b><c>blah</c></b></a>"""
    doc = lh.fromstring(data)
    print(doc)
    # <Element a at b76eb83c>
    

    doc is the Element a. The a tag has no text immediately following it (between the <a> and the <b>. So doc.text is None:

    print(doc.text)
    # None
    

    but there is text after the c tag, so doc.text_content() is not None:

    print(doc.text_content())
    # blah
    

    PS. There is a clear description of the meaning of the text attribute here. Although it is part of the docs for lxml.etree.Element, I think the meaning of the text and tail attributes applies equally well to lxml.html.Element objects.

    0 讨论(0)
  • 2021-01-17 20:02

    You maybe confusing different and incompatible interfaces that lxml implements -- the lxml.etree items have a .text attribute, while (for example) those from lxml.html implement the text_content method (and those from BeautifulSoup, also included in lxml, have a .string attribute... sometimes [[only nodes with a single child which is a string...]]).

    Yeah, it is inherently confusing that lxml chooses both to implement its own interfaces and emulate or include other libraries, but it can be convenient...;-).

    0 讨论(0)
提交回复
热议问题