Equivalent to InnerHTML when using lxml.html to parse HTML

前端 未结 4 1837
无人共我
无人共我 2020-12-01 15:52

I\'m working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

相关标签:
4条回答
  • 2020-12-01 16:19
    import lxml.etree as ET
    
         body = t.xpath("//body");
         for tag in body:
             h = html.fromstring( ET.tostring(tag[0]) ).xpath("//h1");
             p = html.fromstring(  ET.tostring(tag[1]) ).xpath("//p");             
             htext = h[0].text_content();
             ptext = h[0].text_content();
    

    you can also use .get('href') for a tag and .attrib for attribute ,

    here tag no is hardcoded but you can also do this dynamic

    0 讨论(0)
  • 2020-12-01 16:29

    Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:

    <body>This text is ignored
    <h1>Title</h1><p>Some text</p></body>
    

    Text directly under the root element is ignored. I ended up doing this:

    (body.text or '') +\
    ''.join([html.tostring(child) for child in body.iterchildren()])
    
    0 讨论(0)
  • 2020-12-01 16:43

    Here is a Python 3 version:

    from xml.sax import saxutils
    from lxml import html
    
    def inner_html(tree):
        """ Return inner HTML of lxml element """
        return (saxutils.escape(tree.text) if tree.text else '') + \
            ''.join([html.tostring(child, encoding=str) for child in tree.iterchildren()])
    

    Note that this includes escaping of the initial text as recommended by andreymal -- this is needed to avoid tag injection if you're working with sanitized HTML!

    0 讨论(0)
  • 2020-12-01 16:44

    You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

    >>> from lxml import etree
    >>> from cStringIO import StringIO
    >>> t = etree.parse(StringIO("""<body>
    ... <h1>A title</h1>
    ... <p>Some text</p>
    ... </body>"""))
    >>> root = t.getroot()
    >>> for child in root.iterdescendants(),:
    ...  print etree.tostring(child)
    ...
    <h1>A title</h1>
    
    <p>Some text</p>
    

    This can be shorthanded as follows:

    print ''.join([etree.tostring(child) for child in root.iterdescendants()])
    
    0 讨论(0)
提交回复
热议问题