lxml.etree, element.text doesn't return the entire text from an element

后端 未结 8 841
梦毁少年i
梦毁少年i 2021-02-07 10:39

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

 text1  link  text2 
<         


        
相关标签:
8条回答
  • 2021-02-07 10:43

    Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

    0 讨论(0)
  • 2021-02-07 10:44

    As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

    from lxml import etree
    
    def get_text1(node):
        result = node.text or ""
        for child in node:
            if child.tail is not None:
                result += child.tail
        return result
    
    def get_text2(node):
        return ((node.text or '') +
                ''.join(map(get_text2, node)) +
                (node.tail or ''))
    
    def get_text3(node):
        return (node.text or "") + "".join(
            [etree.tostring(child) for child in node.iterchildren()])
    
    
    root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")
    
    print root.xpath("text()")
    print get_text1(root)
    print get_text2(root)
    print root.xpath("string()")
    print etree.tostring(root, method = "text")
    print etree.tostring(root, method = "xml")
    print get_text3(root)
    

    Output is:

    snowy:rpg$ python test.py 
    [' text1 ', ' text2 ']
     text1  text2 
     text1  link  text2 
     text1  link  text2 
     text1  link  text2 
    <td> text1 <a> link </a> text2 </td>
     text1 <a> link </a> text2 
    
    0 讨论(0)
  • 2021-02-07 10:52
    def get_text_recursive(node):
        return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')
    
    0 讨论(0)
  • 2021-02-07 10:56

    looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

    def node_text(node):
        if node.text:
            result = node.text
        else:
            result = ''
        for child in node:
            if child.tail is not None:
                result += child.tail
        return result
    
    0 讨论(0)
  • 2021-02-07 11:04
    element.xpath('normalize-space()') also works.
    
    0 讨论(0)
  • 2021-02-07 11:06

    If the element is equal to <td>. You can do the following.

    element.xpath('.//text()')
    

    It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

    0 讨论(0)
提交回复
热议问题