Is there an elegant way to count tag elements in a xml file using lxml in python?

问题

I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here:

Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way? We assume, that each author appears exactly once.

<root>
    <author>Tim</author>
    <author>Eva</author>
    <author>Martin</author>
    etc.
</root>

This xml file is trivial, but it is possible, that the authors are not always listed one after another, there may be other tags between them.

回答1:

If you want to count all author tags:

import lxml.etree
doc = lxml.etree.parse(xml)
count = doc.xpath('count(//author)')

回答2:

Use an XPath with count.

回答3:

One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

from time import clock
n= 10000
print 'n ==',n,'\n'



import lxml.etree
doc = lxml.etree.parse('xml.txt')

te = clock()
for i in xrange(n):
    countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'



import re
with open('xml.txt') as f:
    ch = f.read()

regx = re.compile('<author>.*?</author>',re.DOTALL)
te = clock()
for i in xrange(n):
    countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'

result

n == 10000 

lxml
count: 3.0 
2.84083032899 seconds

re
count: 3 
0.141663256084 seconds

来源：https://stackoverflow.com/questions/6483851/is-there-an-elegant-way-to-count-tag-elements-in-a-xml-file-using-lxml-in-python

标签

python

xml