I\'m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
import lxml.etree as ET
body = t.xpath("//body");
for tag in body:
h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
p = html.fromstring( ET.tostring(tag[1]) ).xpath("//p");
htext = h[0].text_content();
ptext = h[0].text_content();
you can also use .get('href')
for a tag and .attrib
for attribute ,
here tag no is hardcoded but you can also do this dynamic
Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])
Here is a Python 3 version:
from xml.sax import saxutils
from lxml import html
def inner_html(tree):
""" Return inner HTML of lxml element """
return (saxutils.escape(tree.text) if tree.text else '') + \
''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
... print etree.tostring(child)
...
<h1>A title</h1>
<p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])