lxml.etree, element.text doesn't return the entire text from an element

后端 未结 8 840
梦毁少年i
梦毁少年i 2021-02-07 10:39

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

 text1  link  text2 
<         


        
8条回答
  •  孤街浪徒
    2021-02-07 10:44

    As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

    from lxml import etree
    
    def get_text1(node):
        result = node.text or ""
        for child in node:
            if child.tail is not None:
                result += child.tail
        return result
    
    def get_text2(node):
        return ((node.text or '') +
                ''.join(map(get_text2, node)) +
                (node.tail or ''))
    
    def get_text3(node):
        return (node.text or "") + "".join(
            [etree.tostring(child) for child in node.iterchildren()])
    
    
    root = etree.fromstring(u" text1  link  text2 ")
    
    print root.xpath("text()")
    print get_text1(root)
    print get_text2(root)
    print root.xpath("string()")
    print etree.tostring(root, method = "text")
    print etree.tostring(root, method = "xml")
    print get_text3(root)
    

    Output is:

    snowy:rpg$ python test.py 
    [' text1 ', ' text2 ']
     text1  text2 
     text1  link  text2 
     text1  link  text2 
     text1  link  text2 
     text1  link  text2 
     text1  link  text2 
    

提交回复
热议问题