I seem to remember that Regular Expressions in DotNet have a special mechanism that allows for the correct matching of nested structures, like the grouping in "( (a ( ( c ) b ) ) ( d ) e )
".
What is the python equivalent of this feature? Can this be achieved using regular expressions with some workaround? (Though it seems to be the sort of problem that current implementations of regex aren't designed for)
You can't do this generally using Python regular expressions. (.NET regular expressions have been extended with "balancing groups" which is what allows nested matches.)
However, PyParsing is a very nice package for this type of thing:
from pyparsing import nestedExpr
data = "( (a ( ( c ) b ) ) ( d ) e )"
print nestedExpr().parseString(data).asList()
The output is:
[[['a', [['c'], 'b']], ['d'], 'e']]
More on PyParsing:
Regular expressions cannot parse nested structures. Nested structures are not regular, by definition. They cannot be constructed by a regular grammar, and they cannot be parsed by a finite state automaton (a regular expression can be seen as a shorthand notation for an FSA).
Today's "regex" engines sometimes support some limited "nesting" constructs, but from a technical standpoint, they shouldn't be called "regular" anymore.
Edit: falsetru's nested parser, which I've slightly modified to accept arbitrary regex patterns to specify delimiters and item separators, is faster and simpler than my original re.Scanner
solution:
import re
def parse_nested(text, left=r'[(]', right=r'[)]', sep=r','):
""" https://stackoverflow.com/a/17141899/190597 (falsetru) """
pat = r'({}|{}|{})'.format(left, right, sep)
tokens = re.split(pat, text)
stack = [[]]
for x in tokens:
if not x or re.match(sep, x):
continue
if re.match(left, x):
# Nest a new list inside the current list
current = []
stack[-1].append(current)
stack.append(current)
elif re.match(right, x):
stack.pop()
if not stack:
raise ValueError('error: opening bracket is missing')
else:
stack[-1].append(x)
if len(stack) > 1:
print(stack)
raise ValueError('error: closing bracket is missing')
return stack.pop()
text = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three"
print(parse_nested(text, r'\s*{{', r'}}\s*'))
yields
['a', ['c1::group', ['c2::containing::HINT'], 'a few'], ['c3::words'], 'or three']
Nested structures can not be matched with Python regex alone, but it is remarkably easy to build a basic parser (which can handle nested structures) using re.Scanner:
import re
class Node(list):
def __init__(self, parent=None):
self.parent = parent
class NestedParser(object):
def __init__(self, left='\(', right='\)'):
self.scanner = re.Scanner([
(left, self.left),
(right, self.right),
(r"\s+", None),
(".+?(?=(%s|%s|$))" % (right, left), self.other),
])
self.result = Node()
self.current = self.result
def parse(self, content):
self.scanner.scan(content)
return self.result
def left(self, scanner, token):
new = Node(self.current)
self.current.append(new)
self.current = new
def right(self, scanner, token):
self.current = self.current.parent
def other(self, scanner, token):
self.current.append(token.strip())
It can be used like this:
p = NestedParser()
print(p.parse("((a+b)*(c-d))"))
# [[['a+b'], '*', ['c-d']]]
p = NestedParser()
print(p.parse("( (a ( ( c ) b ) ) ( d ) e )"))
# [[['a', [['c'], 'b']], ['d'], 'e']]
By default NestedParser
matches nested parentheses. You can pass other regex to match other nested patterns, such as brackets, []
. For example,
p = NestedParser('\[', '\]')
result = (p.parse("Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet"))
# ['Lorem ipsum dolor sit amet', ['@a xxx yyy', ['@b xxx yyy', ['@c xxx yyy']]],
# 'lorem ipsum sit amet']
p = NestedParser('<foo>', '</foo>')
print(p.parse("<foo>BAR<foo>BAZ</foo></foo>"))
# [['BAR', ['BAZ']]]
Of course, pyparsing
can do a whole lot more than the above code can. But for this single purpose, the above NestedParser
is about 5x faster for small strings:
In [27]: import pyparsing as pp
In [28]: data = "( (a ( ( c ) b ) ) ( d ) e )"
In [32]: %timeit pp.nestedExpr().parseString(data).asList()
1000 loops, best of 3: 1.09 ms per loop
In [33]: %timeit NestedParser().parse(data)
1000 loops, best of 3: 234 us per loop
and around 28x faster for larger strings:
In [44]: %timeit pp.nestedExpr().parseString('({})'.format(data*10000)).asList()
1 loops, best of 3: 8.27 s per loop
In [45]: %timeit NestedParser().parse('({})'.format(data*10000))
1 loops, best of 3: 297 ms per loop
来源:https://stackoverflow.com/questions/1099178/matching-nested-structures-with-regular-expressions-in-python