Parse xml with lxml - extract element value

前端 未结 3 588
有刺的猬
有刺的猬 2020-12-17 19:25

Let\'s suppose we have the XML file with the structure as follows.

 


        
相关标签:
3条回答
  • 2020-12-17 20:04

    Try the following working code :

    import urllib2
    from lxml import etree
    
    url = "https://dl.dropbox.com/u/540963/short_test.xml"
    fp = urllib2.urlopen(url)
    doc = etree.parse(fp)
    fp.close()
    
    for record in doc.xpath('//datafield'):
        print record.xpath("./@tag")[0]
        for x in record.xpath("./subfield/text()"):
            print "\t", x
    
    0 讨论(0)
  • 2020-12-17 20:09

    I would just go with

    for df in doc.xpath('//datafield'):
        print df.attrib
        for sf in df.getchildren():
            print sf.text
    

    Also you don't need urllib, you can directly parse XML with HTTP

    url = "http://dl.dropbox.com/u/540963/short_test.xml"  #doesn't work with https though
    doc = etree.parse(url)
    
    0 讨论(0)
  • 2020-12-17 20:24

    I would be more direct in your XPath: go straight for the elements you want, in this case datafield.

    >>> for df in doc.xpath('//datafield'):
            # Iterate over attributes of datafield
            for attrib_name in df.attrib:
                    print '@' + attrib_name + '=' + df.attrib[attrib_name]
    
            # subfield is a child of datafield, and iterate
            subfields = df.getchildren()
            for subfield in subfields:
                    print 'subfield=' + subfield.text
    

    Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?

    0 讨论(0)
提交回复
热议问题