I need to parse some XML to pull out embedded template tags for further parsing. I can\'t seem to bend Python\'s regular expressions to do what I want, though.
In Engli
Never ever parse HTML or XML or SGML with regular expressions.
Always use tools like lxml, libxml2 or Beautiful - they will ever do a smarter and better job than your code .
Please don't use regular expressions for this problem.
I'm serious, parsing XML with a regex is hard, and it makes your code 50x less maintainable by anyone else.
lxml is the defacto tool that pythonistas use to parse XML... take a look at this article on Stack Overflow for sample usage. Or consider this answer, which should have been the answer that was accepted.
I hacked this up as a quick demo... it searches for <w:tc>
with non-empty <w:t>
children and prints good next to each element.
import lxml.etree as ET
from lxml.etree import XMLParser
def worthy(elem):
for child in elem.iterchildren():
if (child.tag == 't') and (child.text is not None):
return True
return False
def dump(elem):
for child in elem.iterchildren():
print "Good", child.tag, child.text
parser = XMLParser(ns_clean=True, recover=True)
etree = ET.parse('regex_trial.xml', parser)
for thing in etree.findall("//"):
if thing.tag == 'tc' and worthy(thing):
dump(thing)
Yields...
Good t Header 1
Good t Header 2
Good t Header 3
Good t {% for i in items %}
Good t {{ i.field1 }}
Good t {{ i.field2 }}
Good t {{ i.field3 }}
Good t {% endfor %}