How do I parse XML in Python?

匿名 (未验证) 提交于 2019-12-03 02:08:02

问题:

I have many rows in a database that contains xml and I'm trying to write a Python script that will go through those rows and count how many instances of a particular node attribute show up. For instance, my tree looks like:

<foo>    <bar>       <type foobar="1"/>       <type foobar="2"/>    </bar> </foo> 

How can I access the attributes 1 and 2 in the XML using Python?

回答1:

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

After building an Element instance e from the XML, e.g. with the XML function, or by parsing a file with something like

import xml.etree.ElementTree e = xml.etree.ElementTree.parse('thefile.xml').getroot() 

or any of the many other ways shown at ElementTree, you just do something like:

for atype in e.findall('type'):     print(atype.get('foobar')) 

and similar, usually pretty simple, code patterns.



回答2:

minidom is the quickest and pretty straight forward:

XML:

<data>     <items>         <item name="item1"></item>         <item name="item2"></item>         <item name="item3"></item>         <item name="item4"></item>     </items> </data> 

PYTHON:

from xml.dom import minidom xmldoc = minidom.parse('items.xml') itemlist = xmldoc.getElementsByTagName('item') print(len(itemlist)) print(itemlist[0].attributes['name'].value) for s in itemlist:     print(s.attributes['name'].value) 

OUTPUT

4 item1 item1 item2 item3 item4 


回答3:

You can use BeautifulSoup

from bs4 import BeautifulSoup  x="""<foo>    <bar>       <type foobar="1"/>       <type foobar="2"/>    </bar> </foo>"""  y=BeautifulSoup(x) >>> y.foo.bar.type["foobar"] u'1'  >>> y.foo.bar.findAll("type") [<type foobar="1"></type>, <type foobar="2"></type>]  >>> y.foo.bar.findAll("type")[0]["foobar"] u'1' >>> y.foo.bar.findAll("type")[1]["foobar"] u'2' 


回答4:

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

The relevant metrics can be found in the table below, copied from the cElementTree website:

library                         time    space xml.dom.minidom (Python 2.1)    6.3 s   80000K gnosis.objectify                2.0 s   22000k xml.dom.minidom (Python 2.4)    1.4 s   53000k ElementTree 1.2                 1.6 s   14500k   ElementTree 1.2.4/1.3           1.1 s   14500k   cDomlette (C extension)         0.540 s 20500k PyRXPU (C extension)            0.175 s 10850k libxml2 (C extension)           0.098 s 16000k readlines (read as utf-8)       0.093 s 8850k cElementTree (C extension)  --> 0.047 s 4900K <-- readlines (read as ascii)       0.032 s 5050k    

As pointed out by @jfs, cElementTree comes bundled with Python:

  • Python 2: from xml.etree import cElementTree as ElementTree.
  • Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).


回答5:

lxml.objectify is really simple.

Taking your sample text:

from lxml import objectify from collections import defaultdict  count = defaultdict(int)  root = objectify.fromstring(text)  for item in root.bar.type:     count[item.attrib.get("foobar")] += 1  print dict(count) 

Output:

{'1': 1, '2': 1} 


回答6:

I suggest xmltodict for simplicity.

It parses your xml to an OrderedDict;

>>> e = '<foo>              <bar>                  <type foobar="1"/>                  <type foobar="2"/>              </bar>         </foo> '  >>> import xmltodict >>> result = xmltodict.parse(e) >>> result  OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])  >>> result['foo']  OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])  >>> result['foo']['bar']  OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]) 


回答7:

Python has an interface to the expat xml parser.

xml.parsers.expat 

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

stringofxml = """<foo>     <bar>         <type arg="value" />         <type arg="value" />         <type arg="value" />     </bar>     <bar>         <type arg="value" />     </bar> </foo>""" count = 0 def start(name, attr):     global count     if name == 'type':         count += 1  p = expat.ParserCreate() p.StartElementHandler = start p.Parse(stringofxml)  print count # prints 4 


回答8:

Here a very simple but effective code using cElementTree.

try:     import cElementTree as ET except ImportError:   try:     # Python 2.5 need to import a different module     import xml.etree.cElementTree as ET   except ImportError:     exit_err("Failed to import cElementTree from any known place")        def find_in_tree(tree, node):     found = tree.find(node)     if found == None:         print "No %s in file" % node         found = []     return found    # Parse a xml file (specify the path) def_file = "xml_file_name.xml" try:     dom = ET.parse(open(def_file, "r"))     root = dom.getroot() except:     exit_err("Unable to open and parse input definition file: " + def_file)  # Parse to find the child nodes list of node 'myNode' fwdefs = find_in_tree(root,"myNode") 

Source:

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1



回答9:

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

Installation

pip install untangle 

Usage

Your xml file (a little bit changed):

<foo>    <bar name="bar_name">       <type foobar="1"/>    </bar> </foo> 

accessing the attributes with untangle:

import untangle  obj = untangle.parse('/path_to_xml_file/file.xml')  print obj.foo.bar['name'] print obj.foo.bar.type['foobar'] 

the output will be:

bar_name 1 

More information about untangle can be found here.
Also (if you are curious), you can find a list of tools for working with XML and Python here (you will also see that the most common ones were mentioned by previous answers).



回答10:

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.



回答11:

I might suggest declxml.

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

Parsing into Python data structures is straightforward:

import declxml as xml  xml_string = """ <foo>    <bar>       <type foobar="1"/>       <type foobar="2"/>    </bar> </foo> """  processor = xml.dictionary('foo', [     xml.dictionary('bar', [         xml.array(xml.integer('type', attribute='foobar'))     ]) ])  xml.parse_from_string(processor, xml_string) 

Which produces the output:

{'bar': {'foobar': [1, 2]}} 

You can also use the same processor to serialize data to XML

data = {'bar': {     'foobar': [7, 3, 21, 16, 11] }}  xml.serialize_to_string(processor, data, indent='    ') 

Which produces the following output

<?xml version="1.0" ?> <foo>     <bar>         <type foobar="7"/>         <type foobar="3"/>         <type foobar="21"/>         <type foobar="16"/>         <type foobar="11"/>     </bar> </foo> 

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

import declxml as xml  class Bar:      def __init__(self):         self.foobars = []      def __repr__(self):         return 'Bar(foobars={})'.format(self.foobars)   xml_string = """ <foo>    <bar>       <type foobar="1"/>       <type foobar="2"/>    </bar> </foo> """  processor = xml.dictionary('foo', [     xml.user_object('bar', Bar, [         xml.array(xml.integer('type', attribute='foobar'), alias='foobars')     ]) ])  xml.parse_from_string(processor, xml_string) 

Which produces the following output

{'bar': Bar(foobars=[1, 2])} 


回答12:

import xml.etree.ElementTree as ET data = '''<foo>            <bar>                <type foobar="1"/>                <type foobar="2"/>           </bar>        </foo>''' tree = ET.fromstring(data) lst = tree.findall('bar/type') for item in lst:     print item.get('foobar') 

This will print the value of foobar attribute.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!