Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not s
Accordng to the docs the text_content
method:
Returns the text content of the element, including the text content of its children, with no markup.
So for example,
import lxml.html as lh
data = """<a><b><c>blah</c></b></a>"""
doc = lh.fromstring(data)
print(doc)
# <Element a at b76eb83c>
doc
is the Element
a
. The a
tag has no text immediately following it (between the <a>
and the <b>
. So doc.text
is None
:
print(doc.text)
# None
but there is text after the c
tag, so doc.text_content()
is not None
:
print(doc.text_content())
# blah
PS. There is a clear description of the meaning of the text
attribute here. Although it is part of the docs for lxml.etree.Element
, I think the meaning of the text
and tail
attributes applies equally well to lxml.html.Element
objects.
You maybe confusing different and incompatible interfaces that lxml
implements -- the lxml.etree
items have a .text
attribute, while (for example) those from lxml.html implement the text_content
method (and those from BeautifulSoup, also included in lxml
, have a .string
attribute... sometimes [[only nodes with a single child which is a string...]]).
Yeah, it is inherently confusing that lxml
chooses both to implement its own interfaces and emulate or include other libraries, but it can be convenient...;-).