Parse xml with lxml - extract element value

前端未结

关注

 3  588

Let\'s suppose we have the XML file with the structure as follows.


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-17 20:04
              
            
            
                                                                       
Try the following working code :

import urllib2
from lxml import etree

url = "https://dl.dropbox.com/u/540963/short_test.xml"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()

for record in doc.xpath('//datafield'):
    print record.xpath("./@tag")[0]
    for x in record.xpath("./subfield/text()"):
        print "\t", x

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情歌与酒        
                
              
                            
                2020-12-17 20:09
              
            
            
                                                                       
I would just go with

for df in doc.xpath('//datafield'):
    print df.attrib
    for sf in df.getchildren():
        print sf.text


Also you don't need urllib, you can directly parse XML with HTTP

url = "http://dl.dropbox.com/u/540963/short_test.xml"  #doesn't work with https though
doc = etree.parse(url)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2020-12-17 20:24
              
            
            
                                                                       
I would be more direct in your XPath: go straight for the elements you want, in this case datafield.

>>> for df in doc.xpath('//datafield'):
        # Iterate over attributes of datafield
        for attrib_name in df.attrib:
                print '@' + attrib_name + '=' + df.attrib[attrib_name]

        # subfield is a child of datafield, and iterate
        subfields = df.getchildren()
        for subfield in subfields:
                print 'subfield=' + subfield.text


Also, lxml appears to let you ignore the namespace, maybe because your example only uses one namespace?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复