Equivalent to InnerHTML when using lxml.html to parse HTML

前端未结

关注

 4  1837

I\'m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

相关标签:

4条回答

鱼传尺愫

2020-12-01 16:19

import lxml.etree as ET

     body = t.xpath("//body");
     for tag in body:
         h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
         p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");             
         htext = h[0].text_content();
         ptext = h[0].text_content();

you can also use .get('href') for a tag and .attrib for attribute ,

here tag no is hardcoded but you can also do this dynamic

0 讨论(0)

一生所求

2020-12-01 16:29
Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
```
<body>This text is ignored
<h1>Title</h1><p>Some text</p></body>
```
Text directly under the root element is ignored. I ended up doing this:
```
(body.text or '') +\
''.join([html.tostring(child) for child in body.iterchildren()])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2020-12-01 16:43

Here is a Python 3 version:

from xml.sax import saxutils
from lxml import html

def inner_html(tree):
    """ Return inner HTML of lxml element """
    return (saxutils.escape(tree.text) if tree.text else '') + \
        ''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])

Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!

0 讨论(0)

感情败类

2020-12-01 16:44

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree
>>> from cStringIO import StringIO
>>> t = etree.parse(StringIO("""<body>
... <h1>A title</h1>
... <p>Some text</p>
... </body>"""))
>>> root = t.getroot()
>>> for child in root.iterdescendants(),:
...  print etree.tostring(child)
...
<h1>A title</h1>

<p>Some text</p>

This can be shorthanded as follows:

print ''.join([etree.tostring(child) for child in root.iterdescendants()])

0 讨论(0)