How to write a recursive regex that matches nested parentheses?

后端 未结 3 956
别那么骄傲
别那么骄傲 2020-11-28 12:29

I am trying to write a regexp which matches nested parentheses, e.g.:

\"(((text(text))))(text()()text)(casual(characters(#$%^^&&#^%#@!&**&#^*         


        
相关标签:
3条回答
  • 2020-11-28 12:55

    The following code uses my Parser class (it's under CC-BY 3.0), it works on UTF-8 (thanks to my UTF8 class).

    The way it works is by using a recursive function to iterate over the string. It will call itself each time it finds a (. It will also detect missmatched pairs when it reaches the end of the string without finding the corresponding ).

    Also, this code takes a $callback parameter you can use to process each piece it finds. The callback recieves two parameters: 1) the string, and 2) the level (0 = deepest). Whatever the callback returns will be replaced in the contents of the string (this changes are visible at callback of higher level).

    Note: the code does not includes type checks.

    Non-recursive part:

    function ParseParenthesis(/*string*/ $string, /*function*/ $callback)
    {
        //Create a new parser object
        $parser = new Parser($string);
        //Call the recursive part
        $result = ParseParenthesisFragment($parser, $callback);
        if ($result['close'])
        {
            return $result['contents'];
        }
        else
        {
            //UNEXPECTED END OF STRING
            // throw new Exception('UNEXPECTED END OF STRING');
            return false;
        }
    }
    

    Recursive part:

    function ParseParenthesisFragment(/*parser*/ $parser, /*function*/ $callback)
    {
        $contents = '';
        $level = 0;
        while(true)
        {
            $parenthesis = array('(', ')');
            // Jump to the first/next "(" or ")"
            $new = $parser->ConsumeUntil($parenthesis);
            $parser->Flush(); //<- Flush is just an optimization
            // Append what we got so far
            $contents .= $new;
            // Read the "(" or ")"
            $element = $parser->Consume($parenthesis);
            if ($element === '(') //If we found "("
            {
                //OPEN
                $result = ParseParenthesisFragment($parser, $callback);
                if ($result['close'])
                {
                    // It was closed, all ok
                    // Update the level of this iteration
                    $newLevel = $result['level'] + 1;
                    if ($newLevel > $level)
                    {
                        $level = $newLevel;
                    }
                    // Call the callback
                    $new = call_user_func
                    (
                        $callback,
                        $result['contents'],
                        $level
                    );
                    // Append what we got
                    $contents .= $new;
                }
                else
                {
                    //UNEXPECTED END OF STRING
                    // Don't call the callback for missmatched parenthesis
                    // just append and return
                    return array
                    (
                        'close' => false,
                        'contents' => $contents.$result['contents']
                    );
                }
            }
            else if ($element == ')') //If we found a ")"
            {
                //CLOSE
                return array
                (
                    'close' => true,
                    'contents' => $contents,
                    'level' => $level
                );
            }
            else if ($result['status'] === null)
            {
                //END OF STRING
                return array
                (
                    'close' => false,
                    'contents' => $contents
                );
            }
        }
    }
    
    0 讨论(0)
  • 2020-11-28 13:00

    This pattern works:

    $pattern = '~ \( (?: [^()]+ | (?R) )*+ \) ~x';
    

    The content inside parenthesis is simply describe:

    "all that is not parenthesis OR recursion (= other parenthesis)" x 0 or more times

    If you want to catch all substrings inside parenthesis, you must put this pattern inside a lookahead to obtain all overlapping results:

    $pattern = '~(?= ( \( (?: [^()]+ | (?1) )*+ \) ) )~x';
    preg_match_all($pattern, $subject, $matches);
    print_r($matches[1]);
    

    Note that I have added a capturing group and I have replaced (?R) by (?1):

    (?R) -> refers to the whole pattern (You can write (?0) too)
    (?1) -> refers to the first capturing group
    

    What is this lookahead trick?

    A subpattern inside a lookahead (or a lookbehind) doesn't match anything, it's only an assertion (a test). Thus, it allows to check the same substring several times.

    If you display the whole pattern results (print_r($matches[0]);), you will see that all results are empty strings. The only way to obtain the substrings found by the subpattern inside the lookahead is to enclose the subpattern in a capturing group.

    Note: the recursive subpattern can be improved like this:

    \( [^()]*+ (?: (?R) [^()]* )*+ \)
    
    0 讨论(0)
  • 2020-11-28 13:08

    When I found this answer I wasn't able to figure out how to modify the pattern to work with my own delimiters which where { and }. So my approach was to make it more generic.

    Here is a script to generate the regex pattern with your own variable left and right delimiters.

    $delimiter_wrap  = '~';
    $delimiter_left  = '{';/* put YOUR left delimiter here.  */
    $delimiter_right = '}';/* put YOUR right delimiter here. */
    
    $delimiter_left  = preg_quote( $delimiter_left,  $delimiter_wrap );
    $delimiter_right = preg_quote( $delimiter_right, $delimiter_wrap );
    $pattern         = $delimiter_wrap . $delimiter_left
                     . '((?:[^' . $delimiter_left . $delimiter_right . ']++|(?R))*)'
                     . $delimiter_right . $delimiter_wrap;
    
    /* Now you can use the generated pattern. */
    preg_match_all( $pattern, $subject, $matches );
    
    0 讨论(0)
提交回复
热议问题