Get the inner HTML of a element in lxml

后端 未结 6 2156
粉色の甜心
粉色の甜心 2020-12-13 19:10

I am trying to get the HTML content of child node with lxml and xpath in Python. As shown in code below, I want to find the html content of the each of product nodes. Does i

相关标签:
6条回答
  • 2020-12-13 19:21

    You can use product.text_content()

    0 讨论(0)
  • 2020-12-13 19:23

    another way to do this

    x=doc.xpath("//div[@class='name']/parent::*")
    print(map(etree.tostring,x))
    
    0 讨论(0)
  • 2020-12-13 19:27
    from lxml import etree
    print(etree.tostring(root, pretty_print=True))
    

    you may see more examples here: http://lxml.de/tutorial.html

    0 讨论(0)
  • 2020-12-13 19:28

    Simple function to get innerHTML or innerXML
    .
    Try it out directly https://pyfiddle.io/fiddle/631aa049-2785-4c58-bf82-eff4e2f8bedb/
    .

    function

    
    def innerXML(elem):
        elemName = elem.xpath('name(/*)')
        resultStr = ''
        for e in elem.xpath('/'+ elemName + '/node()'):
            if(isinstance(e, str) ):
                resultStr = resultStr + ''
            else:
                resultStr = resultStr + etree.tostring(e, encoding='unicode')
    
        return resultStr
    
    

    invocation

    XMLElem = etree.fromstring("<div>I am<name>Jhon <last.name> Corner</last.name></name>.I work as <job>software engineer</job><end meta='bio' />.</div>")
    print(innerXML(XMLElem))
    

    .
    Logic Behind

    • get the outermost element name first,
    • Then get all the child nodes
    • Convert all the child nodes to string using tostring
    • Concatinate Them
    0 讨论(0)
  • 2020-12-13 19:34

    I believe you want to use the tostring() method:

    from lxml import etree
    
    tree = etree.fromstring('<html><head><title>foo</title></head><body><div class="name"><p>foo</p></div><div class="name"><ul><li>bar</li></ul></div></body></html>')
    for elem in tree.xpath("//div[@class='name']"):
         # pretty_print ensures that it is nicely formatted.
         print etree.tostring(elem, pretty_print=True)
    
    0 讨论(0)
  • 2020-12-13 19:37

    After right clicking (copy, copy xpath) on the specific field you want (in chrome's inspector), you might get something like this:

    //*[@id="specialID"]/div[12]/div[2]/h4/text()[1]
    

    If you wanted that text element for each "specialID"

    //*[@id="specialID"]/div/div[2]/h4/text()[1]
    

    You could select another field and it'll interleave the results

    //*[@id="specialID"]/div/div[2]/h4/text()[1] | //*[@id="specialID"]/div/some/weird/path[95]
    

    Example could be improved, but it illustrates the point:

    //*[@id="mw-content-text"]/div/ul[1]/li[11]/text()
    

    from lxml import html
    import requests
    page = requests.get('https://en.wikipedia.org/wiki/Web_scraping')
    tree = html.fromstring(page.content)
    data = tree.xpath('//*[@id="mw-content-text"]/div/ul[1]/li/a/text() | //*[@id="mw-content-text"]/div/ul[1]/li/text()[1]')
    print(len(data))
    for i in range(len(data)):
        print(data[i])
    
    0 讨论(0)
提交回复
热议问题