I have this XML file:
virtual bug
66523dfdf555dfd
XML
<data>
<items>
<item name="item1">item1</item>
<item name="item2">item2</item>
<item name="item3">item3</item>
<item name="item4">item4</item>
</items>
</data>
Python :
from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print "Len : ", len(itemlist)
print "Attribute Name : ", itemlist[0].attributes['name'].value
print "Text : ", itemlist[0].firstChild.nodeValue
for s in itemlist :
print "Attribute Name : ", s.attributes['name'].value
print "Text : ", s.firstChild.nodeValue
Here's an lxml snippet that extracts an attribute as well as element text (your question was a little ambiguous about which one you needed, so I'm including both):
from lxml import etree
doc = etree.parse(filename)
memoryElem = doc.find('memory')
print memoryElem.text # element text
print memoryElem.get('unit') # attribute
You asked (in a comment on Ali Afshar's answer) whether minidom (2.x, 3.x) is a good alternative. Here's the equivalent code using minidom; judge for yourself which is nicer:
import xml.dom.minidom as minidom
doc = minidom.parse(filename)
memoryElem = doc.getElementsByTagName('memory')[0]
print ''.join( [node.data for node in memoryElem.childNodes] )
print memoryElem.getAttribute('unit')
lxml seems like the winner to me.
You can try parsing it with using (recover=True). you can do something like this.
parser = etree.XMLParser(recover=True)
tree = etree.parse('your xml file', parser)
I used this recently and it worked for me, you can try and see but in case you need to do any more complecated xml data extractions, you can take a look at this code i wrote for some project handling complex xml data extractions.
Above XML does not have closing tag, It will give
etree parse error: Premature end of data in tag
Correct XML is:
<domain type='kmc' id='007'>
<name>virtual bug</name>
<uuid>66523dfdf555dfd</uuid>
<os>
<type arch='xintel' machine='ubuntu'>hvm</type>
<boot dev='hd'/>
<boot dev='cdrom'/>
</os>
<memory unit='KiB'>524288</memory>
<currentMemory unit='KiB'>270336</currentMemory>
<vcpu placement='static'>10</vcpu>
</domain>
Other people can tell you how to do it with the Python standard library. I'd recommend my own mini-library that makes this a completely straight forward.
>>> obj = xml2obj.xml2obj("""<domain type='kmc' id='007'>
... <name>virtual bug</name>
... <uuid>66523dfdf555dfd</uuid>
... <os>
... <type arch='xintel' machine='ubuntu'>hvm</type>
... <boot dev='hd'/>
... <boot dev='cdrom'/>
... </os>
... <memory unit='KiB'>524288</memory>
... <currentMemory unit='KiB'>270336</currentMemory>
... <vcpu placement='static'>10</vcpu>
... </domain>""")
>>> obj.uuid
u'66523dfdf555dfd'
http://code.activestate.com/recipes/534109-xml-to-python-data-structure/
I would use lxml and parse it out using xpath //UUID