I am trying to write a regexp which matches nested parentheses, e.g.:
\"(((text(text))))(text()()text)(casual(characters(#$%^^&^%#@!&**^*
The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).
The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (
. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding )
.
Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).
Note: the code does not includes type checks.
Non-recursive part:
function ParseParenthesis(/*string*/ $string, /*function*/ $callback)
{
//Create a new parser object
$parser = new Parser($string);
//Call the recursive part
$result = ParseParenthesisFragment($parser, $callback);
if ($result['close'])
{
return $result['contents'];
}
else
{
//UNEXPECTED END OF STRING
// throw new Exception('UNEXPECTED END OF STRING');
return false;
}
}
Recursive part:
function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback)
{
$contents = '';
$level = 0;
while(true)
{
$parenthesis = array('(', ')');
// Jump to the first/next "(" or ")"
$new = $parser->ConsumeUntil($parenthesis);
$parser->Flush(); //<- Flush is just an optimization
// Append what we got so far
$contents .= $new;
// Read the "(" or ")"
$element = $parser->Consume($parenthesis);
if ($element === '(') //If we found "("
{
//OPEN
$result = ParseParenthesisFragment($parser, $callback);
if ($result['close'])
{
// It was closed, all ok
// Update the level of this iteration
$newLevel = $result['level'] + 1;
if ($newLevel > $level)
{
$level = $newLevel;
}
// Call the callback
$new = call_user_func
(
$callback,
$result['contents'],
$level
);
// Append what we got
$contents .= $new;
}
else
{
//UNEXPECTED END OF STRING
// Don't call the callback for missmatched parenthesis
// just append and return
return array
(
'close' => false,
'contents' => $contents.$result['contents']
);
}
}
else if ($element == ')') //If we found a ")"
{
//CLOSE
return array
(
'close' => true,
'contents' => $contents,
'level' => $level
);
}
else if ($result['status'] === null)
{
//END OF STRING
return array
(
'close' => false,
'contents' => $contents
);
}
}
}
This pattern works:
$pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';
The content inside parenthesis is simply describe:
"all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times
If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:
$pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
preg_match_all($pattern, $subject, $matches);
print_r($matches[1]);
Note that I have added a capturing group and I have replaced (?R)
by (?1)
:
(?R) -> refers to the whole pattern (You can write (?0) too)
(?1) -> refers to the first capturing group
What is this lookahead trick?
A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.
If you display the whole pattern results (print_r($matches[0]);
), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.
Note: the recursive subpattern can be improved like this:
\( [^()]*+ (?: (?R) [^()]* )*+ \)
When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where {
and }
. So my approach was to make it more generic.
$delimiter_wrap = '~';
$delimiter_left = '{';/* put YOUR left delimiter here. */
$delimiter_right = '}';/* put YOUR right delimiter here. */
$delimiter_left = preg_quote( $delimiter_left, $delimiter_wrap );
$delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
$pattern = $delimiter_wrap . $delimiter_left
. '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
. $delimiter_right . $delimiter_wrap;
/* Now you can use the generated pattern. */
preg_match_all( $pattern, $subject, $matches );