Regular Expressions to parse template tags in XML

后端未结

关注

 2  1574

I need to parse some XML to pull out embedded template tags for further parsing. I can\'t seem to bend Python\'s regular expressions to do what I want, though.

In Engli

相关标签:

2条回答

悲&欢浪女

2021-01-22 05:23

Never ever parse HTML or XML or SGML with regular expressions.

Always use tools like lxml, libxml2 or Beautiful - they will ever do a smarter and better job than your code .

0 讨论(0)
发布评论:

提交评论
- 加载中...

无人及你

2021-01-22 05:31

Please don't use regular expressions for this problem.

I'm serious, parsing XML with a regex is hard, and it makes your code 50x less maintainable by anyone else.

lxml is the defacto tool that pythonistas use to parse XML... take a look at this article on Stack Overflow for sample usage. Or consider this answer, which should have been the answer that was accepted.

I hacked this up as a quick demo... it searches for <w:tc> with non-empty <w:t> children and prints good next to each element.

import lxml.etree as ET
from lxml.etree import XMLParser

def worthy(elem):
    for child in elem.iterchildren():
        if (child.tag == 't') and (child.text is not None):
            return True
    return False

def dump(elem):
    for child in elem.iterchildren():
        print "Good", child.tag, child.text

parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('regex_trial.xml', parser)
for thing in etree.findall("//"):
    if thing.tag == 'tc' and worthy(thing):
        dump(thing)

Yields...

Good t Header 1
Good t Header 2
Good t Header 3
Good t {% for i in items %}
Good t {{ i.field1 }}
Good t {{ i.field2 }}
Good t {{ i.field3 }}
Good t {% endfor %}

0 讨论(0)