Recursive PHP Regex

前端 未结 4 715
醉梦人生
醉梦人生 2020-12-05 07:00

EDIT: I selected ridgerunner\'s answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the s

相关标签:
4条回答
  • 2020-12-05 07:50

    Excellent (and difficult) question!

    First, with the PCRE regex engine, the (?R) behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R) expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa as the subject text, lets use: abcd. And lets change the regex from '#a(?:(?R)|a?)a#' to: '#.(?:(?R)|.?).#'. The regex engine matching behavior is the same.

    Matching regex: /.(?:(?R)|.?)./ to: "abcd"

    answer = r'''
    Step Depth Regex          Subject  Comment
    1    0     .(?:(?R)|.?).  abcd     Dot matches "a". Advance pointers.
               ^              ^
    2    0     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 1).
                     ^         ^
    3    1     .(?:(?R)|.?).  abcd     Dot matches "b". Advance pointers.
               ^               ^
    4    1     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 2).
                     ^          ^
    5    2     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
               ^                ^
    6    2     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 3).
                     ^           ^
    7    3     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
               ^                 ^
    8    3     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 4).
                     ^            ^
    9    4     .(?:(?R)|.?).  abcd     Dot fails to match end of string.
               ^                  ^    DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
                                       Give back text consumed by depth 4 (?R) = ""
    10   3     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches EOS.
                        ^         ^    Advance regex pointer.
    11   3     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    DEPTH 3 (?R) FAILS. Return to step 6 depth 2
                                       Give back text consumed by depth3 (?R) = "d"
    12   2     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "d".
                        ^        ^     Advance pointers.
    13   2     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to step 12 depth 2
    14   2     .(?:(?R)|.?).  abcd     Match zero "d" (give it back).
                        ^        ^     Advance regex pointer.
    15   2     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
                           ^     ^     DEPTH 2 (?R) SUCCEEDS.
                                       Return to step 4 depth 1
    16   1     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to try other alternative. Give back
                                        text consumed by depth 2 (?R) = "cd"
    17   1     .(?:(?R)|.?).  abcd     Optional dot matches "c". Advance pointers.
                        ^       ^      
    18   1     .(?:(?R)|.?).  abcd     Required dot matches "d". Advance pointers.
                           ^     ^     DEPTH 1 (?R) SUCCEEDS.
                                       Return to step 2 depth 0
    19   0     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to try other alternative. Give back
                                        text consumed by depth 1 (?R) = "bcd"
    20   0     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "b".
                        ^      ^       Advance pointers.
    21   0     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
                           ^    ^      SUCCESSFUL MATCH of "abc"
    '''
    

    There is nothing wrong with the regex engine. The correct match is abc (or aaa for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.

    0 讨论(0)
  • 2020-12-05 07:50

    IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.

    Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.

    Since your outer as are explicitly there, it will match an a between two as, or a previous recursion's match of the entire pattern between two as. As a result, it will only match odd numbers of as (middle one plus multiples of two).

    At length of three, aaa is the current recursion's matching pattern, so on the fourth recursion it's looking for an a between two as (i.e., aaa) or the previous recursion's matched pattern between two as (i.e., a+aaa+a). Obviously it can't match five as when the string isn't that long, so the longest match it can make is three.

    Similar deal with a length of six, as it can only match the "default" aaa or the previous recursion's match surrounded by as (i.e., a+aaaaa+a).


    However, it does not match all odd lengths.

    Since you're matching recursively, you can only match the literal aaa or a+(prev recurs match)+a. Each successive match will therefore always be two as longer than the previous match, or it will punt and fall back to aaa.

    At a length of seven (matching against aaaaaaa), the previous recursion's match was the fallback aaa. So this time, even though there are seven as, it will only match three (aaa) or five (a+aaa+a).


    When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):

    no match
    aa
    aaa
    aaa
    aaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    aaa
    aaaaa
    aaaaaaa
    aaaaaaaaa
    aaaaaaaaaaa
    aaaaaaaaaaaaa
    aaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaa
    aaaaaaaaaaaaaaaaaaa
    

    What's going on here? Well, I'll tell you! :-)

    When a recursive match would be one character longer than the input string, it punts back to aaa, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back to aaa. And so on.

    Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:

    (input len.)  -  (match len.)  =  (difference)
    
     1   -    0   =    1
     2   -    2   =    0
     3   -    3   =    0
     4   -    3   =    1
     5   -    5   =    0
     6   -    3   =    3
     7   -    5   =    2
     8   -    7   =    1
     9   -    9   =    0
    10   -    3   =    7
    11   -    5   =    6
    12   -    7   =    5
    13   -    9   =    4
    14   -   11   =    3
    15   -   13   =    2
    16   -   15   =    1
    17   -   17   =    0
    18   -    3   =   15
    19   -    5   =   14
    20   -    7   =   13
    21   -    9   =   12
    22   -   11   =   11
    23   -   13   =   10
    24   -   15   =    9
    25   -   17   =    8
    26   -   19   =    7
    27   -   21   =    6
    28   -   23   =    5
    29   -   25   =    4
    30   -   27   =    3
    31   -   29   =    2
    32   -   31   =    1
    33   -   33   =    0
    34   -    3   =   31
    35   -    5   =   30
    36   -    7   =   29
    37   -    9   =   28
    38   -   11   =   27
    39   -   13   =   26
    40   -   15   =   25
    41   -   17   =   24
    42   -   19   =   23
    43   -   21   =   22
    44   -   23   =   21
    45   -   25   =   20
    46   -   27   =   19
    47   -   29   =   18
    48   -   31   =   17
    49   -   33   =   16
    50   -   35   =   15
    51   -   37   =   14
    52   -   39   =   13
    53   -   41   =   12
    54   -   43   =   11
    55   -   45   =   10
    56   -   47   =    9
    57   -   49   =    8
    58   -   51   =    7
    59   -   53   =    6
    60   -   55   =    5
    61   -   57   =    4
    62   -   59   =    3
    63   -   61   =    2
    64   -   63   =    1
    65   -   65   =    0
    66   -    3   =   63
    67   -    5   =   62
    68   -    7   =   61
    69   -    9   =   60
    70   -   11   =   59
    71   -   13   =   58
    72   -   15   =   57
    73   -   17   =   56
    74   -   19   =   55
    75   -   21   =   54
    76   -   23   =   53
    77   -   25   =   52
    78   -   27   =   51
    79   -   29   =   50
    80   -   31   =   49
    

    For reasons that should now make sense, this happens at multiples of 2.


    Stepping through by hand

    I've slightly simplified the original pattern for this example. Remember this. We will come back to it.

    a((?R)|a)a
    

    What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of (?R) as many times as possible.

    a((?R)|a)a                    # this
    
    a((a((?R)|a)a)|a)a            # becomes this
    
    a((a((a((?R)|a)a)|a)a)|a)a    # becomes this
    
    # and so on...
    

    When tracing this by hand, you could work from the inside out. In (?R)|a, a is your base case. So we'll start with that.

    a(a)a
    

    If that matches the input string, take that match (aaa) back to the original expression and put it in place of (?R).

    a(aaa|a)a
    

    If the input string is matched with our recursive value, subtitute that match (aaaaa) back into the original expression to recurse again.

    a(aaaaa|a)a
    

    Repeat until you can't match your input using the result of the previous recursion.

    Example
    Input: aaaaaa
    Regex: a((?R)|a)a

    Start at base case, aaa.
    Does the input match with this value? Yes: aaa
    Recurse by putting aaa in the original expression:

    a(aaa|a)a
    

    Does the input match with our recursive value? Yes: aaaaa
    Recurse by putting aaaaa in the original expression:

    a(aaaaa|a)a
    

    Does the input match with our recursive value? No: aaaaaaa

    Then we stop here. The above expression could be rewritten (for simplicity) as:

    aaaaaaa|aaa
    

    Since it doesn't match aaaaaaa, it must match aaa. We're done, aaa is the final result.

    0 讨论(0)
  • 2020-12-05 07:55

    After a lot of experimentation I think the PHP regex engine is broken. The exact same code under Perl works fine and matches all of your strings from beginning to end as I would expect.

    Recursive regexes are hard on the imagination, but it looks to me as if /a(?:(?R)|a?)a/ should match aaaa as an a..a pair containing a second a..a pair, after which a second recursion fails and the alternate /a?/ matches instead as a null string.

    0 讨论(0)
  • 2020-12-05 08:06

    Okay, I finally have it.

    I awarded the correct answer to ridgerunner as he put me on the path to the solution, but I also wanted to write a full answer to the specific question in case someone else wants to fully understand the example too.

    First the solution, then some notes.

    A. Solution

    Here is a summary of the steps followed by the engine. The steps should be read from top to bottom. They are not numbered. The recursion depth is shown in the left column, going up from zero to for and back down to zero. For convenience, the expression is shown at the top right. For ease of readability, the "a"s being matched are shown at their place in the string (which is shown at the very top).

            STRING    EXPRESSION
            a a a a   a(?:(?R|a?))a
    
    Depth   Match     Token
        0   a         first a from depth 0. Next step in the expression: depth 1.
        1     a       first a from depth 1. Next step in the expression: depth 2. 
        2       a     first a from depth 2. Next step in the expression: depth 3.  
        3         a   first a from depth 3. Next step in the expression: depth 4.  
        4             depth 4 fails to match anything. Back to depth 3 @ alternation.
        3             depth 3 fails to match rest of expression, back to depth 2
        2       a a   depth 2 completes as a/empty/a, back to depth 1
        1     a[a a]  a/[detph 2]a fails to complete, discard depth 2, back to alternation
        1     a       first a from depth 1
        1     a a     a from alternation
        1     a a a   depth 1 completes, back to depth 0
        0   a[a a a]  depth 0 fails to complete, discard depth 1, back to alternation
        0   a         first a from depth 0
        0   a a       a from alternation
        0   a a a     expression ends with successful match   
    

    B. Notes

    1. The source of confusion


    Here is what was counter-intuitive about it for me.

    We are trying to match a a a a

    I assumed that depth 0 of the recursion would match as a - - a and that depth 1 would match as - a a -

    But in fact depth 1 first matches as - a a a

    So depth 0 has nowhere to go to finish the match:

    a [D1: a a a] 
    

    ...then what? We are out of characters but the expression is not over.

    So depth 1 is discarded. Note that depth 1 is not attempted again by giving back characters, which would lead us to a different depth 1 match of - a a -

    That's because recursive matches are atomic. Once a depth matches, it's all or nothing, you keep it all or you discard it all.

    Once depth 1 is discarded, depth 0 moves on to the other side of the alternation, and returns the match: a a a

    2. The source of clarity


    What helped me the most was the example that ridgerunner gave. In his example, he showed how to trace the path of the engine, which is exactly what I wanted to understand.

    Following this method, I traced the full path of the engine for our specific example. As I have it, the path is 25 steps long, so it is considerably longer than the summary above. But the summary is accurate to the path I traced.

    Big Thanks to everyone else who contributed, in particular Wiseguy for a very intriguing presentation. I still wonder if somehow I might be missing something and Wiseguy's answer might amount to the same!

    0 讨论(0)
提交回复
热议问题