Is there an elegant way to count tag elements in a xml file using lxml in python?

前端 未结 3 2106
暗喜
暗喜 2021-02-13 01:35

I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue i

相关标签:
3条回答
  • 2021-02-13 02:20

    Use an XPath with count.

    0 讨论(0)
  • 2021-02-13 02:21

    One must be careful using module re to treat a SGML/XML/HTML text, because not all treatments of such files can't be performed with regex (regexes aren't able to parse a SGML/HTML/XML text)

    But here, in this particular problem, it seems to me it is possible (re.DOTALL is mandatory because an element may extend on more than one line; apart that, I can't imagine any other possible pitfall)

    from time import clock
    n= 10000
    print 'n ==',n,'\n'
    
    
    
    import lxml.etree
    doc = lxml.etree.parse('xml.txt')
    
    te = clock()
    for i in xrange(n):
        countlxml = doc.xpath('count(//author)')
    tf = clock()
    print 'lxml\ncount:',countlxml,'\n',tf-te,'seconds'
    
    
    
    import re
    with open('xml.txt') as f:
        ch = f.read()
    
    regx = re.compile('<author>.*?</author>',re.DOTALL)
    te = clock()
    for i in xrange(n):
        countre = sum(1 for mat in regx.finditer(ch))
    tf = clock()
    print '\nre\ncount:',countre,'\n',tf-te,'seconds'
    

    result

    n == 10000 
    
    lxml
    count: 3.0 
    2.84083032899 seconds
    
    re
    count: 3 
    0.141663256084 seconds
    
    0 讨论(0)
  • 2021-02-13 02:23

    If you want to count all author tags:

    import lxml.etree
    doc = lxml.etree.parse(xml)
    count = doc.xpath('count(//author)')
    
    0 讨论(0)
提交回复
热议问题