Regular Expressions to parse template tags in XML

后端 未结 2 1574
慢半拍i
慢半拍i 2021-01-22 04:50

I need to parse some XML to pull out embedded template tags for further parsing. I can\'t seem to bend Python\'s regular expressions to do what I want, though.

In Engli

相关标签:
2条回答
  • 2021-01-22 05:23

    Never ever parse HTML or XML or SGML with regular expressions.

    Always use tools like lxml, libxml2 or Beautiful - they will ever do a smarter and better job than your code .

    0 讨论(0)
  • 2021-01-22 05:31

    Please don't use regular expressions for this problem.

    I'm serious, parsing XML with a regex is hard, and it makes your code 50x less maintainable by anyone else.

    lxml is the defacto tool that pythonistas use to parse XML... take a look at this article on Stack Overflow for sample usage. Or consider this answer, which should have been the answer that was accepted.

    I hacked this up as a quick demo... it searches for <w:tc> with non-empty <w:t> children and prints good next to each element.

    import lxml.etree as ET
    from lxml.etree import XMLParser
    
    def worthy(elem):
        for child in elem.iterchildren():
            if (child.tag == 't') and (child.text is not None):
                return True
        return False
    
    def dump(elem):
        for child in elem.iterchildren():
            print "Good", child.tag, child.text
    
    parser = XMLParser(ns_clean=True, recover=True)
    etree = ET.parse('regex_trial.xml', parser)
    for thing in etree.findall("//"):
        if thing.tag == 'tc' and worthy(thing):
            dump(thing)
    

    Yields...

    Good t Header 1
    Good t Header 2
    Good t Header 3
    Good t {% for i in items %}
    Good t {{ i.field1 }}
    Good t {{ i.field2 }}
    Good t {{ i.field3 }}
    Good t {% endfor %}
    
    0 讨论(0)
提交回复
热议问题