I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue i
One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)
But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)
from time import clock
n= 10000
print 'n ==',n,'\n'
import lxml.etree
doc = lxml.etree.parse('xml.txt')
te = clock()
for i in xrange(n):
countlxml = doc.xpath('count(//author)')
tf = clock()
print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'
import re
with open('xml.txt') as f:
ch = f.read()
regx = re.compile('.*? ',re.DOTALL)
te = clock()
for i in xrange(n):
countre = sum(1 for mat in regx.finditer(ch))
tf = clock()
print '\nre\ncount:',countre,'\n',tf-te,'seconds'
result
n == 10000
lxml
count: 3.0
2.84083032899 seconds
re
count: 3
0.141663256084 seconds