How can I access namespaced XML elements using BeautifulSoup?

后端 未结 3 937
清酒与你
清酒与你 2020-12-07 01:30

I have an XML document which reads like this:



4000
0
<         


        
相关标签:
3条回答
  • 2020-12-07 01:42

    BeautifulSoup isn't a DOM library per se (it doesn't implement the DOM APIs). To make matters more complicated, you're using namespaces in that xml fragment. To parse that specific piece of XML, you'd use BeautifulSoup as follows:

    from BeautifulSoup import BeautifulSoup
    
    xml = """<xml>
      <web:Web>
        <web:Total>4000</web:Total>
        <web:Offset>0</web:Offset>
      </web:Web>
    </xml>"""
    
    doc = BeautifulSoup( xml )
    print doc.find( 'web:total' ).string
    print doc.find( 'web:offset' ).string
    

    If you weren't using namespaces, the code could look like this:

    from BeautifulSoup import BeautifulSoup
    
    xml = """<xml>
      <Web>
        <Total>4000</Total>
        <Offset>0</Offset>
      </Web>
    </xml>"""
    
    doc = BeautifulSoup( xml )
    print doc.xml.web.total.string
    print doc.xml.web.offset.string
    

    The key here is that BeautifulSoup doesn't know (or care) anything about namespaces. Thus web:Web is treated like a web:web tag instead of as a Web tag belonging to th eweb namespace. While BeautifulSoup adds web:web to the xml element dictionary, python syntax doesn't recognize web:web as a single identifier.

    You can learn more about it by reading the documentation.

    0 讨论(0)
  • 2020-12-07 01:44

    You should explicitly define your namespace on root element, using xmlns:prefix="URI"syntax (see examples here), and then you access you attribute via prefix:tag from BeautifulSoup. Keep in mind,what you also should explicitly define, how BeautifulSoup should process you document, in that case:

    xml = BeautifulSoup(xml_content, 'xml')

    0 讨论(0)
  • 2020-12-07 01:56

    This is an old question but somebody might not know that at least BeautifulSoup 4 does handle namespaces well if you pass 'xml' as second argument to the constructor:

    soup = BeautifulSoup("""<xml>
    <web:Web>
    <web:Total>4000</web:Total>
    <web:Offset>0</web:Offset>
    </web:Web>
    </xml>""", 'xml')
    
    print soup.prettify()
    <?xml version="1.0" encoding="utf-8"?>
    <xml>
     <Web>
      <Total>
       4000
      </Total>
      <Offset>
       0
      </Offset>
     </Web>
    </xml>
    
    0 讨论(0)
提交回复
热议问题