EDIT: I selected ridgerunner\'s answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the s
Excellent (and difficult) question!
First, with the PCRE regex engine, the (?R)
behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R)
expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa
as the subject text, lets use: abcd
. And lets change the regex from '#a(?:(?R)|a?)a#'
to: '#.(?:(?R)|.?).#'
. The regex engine matching behavior is the same.
/.(?:(?R)|.?)./
to: "abcd"
answer = r'''
Step Depth Regex Subject Comment
1 0 .(?:(?R)|.?). abcd Dot matches "a". Advance pointers.
^ ^
2 0 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 1).
^ ^
3 1 .(?:(?R)|.?). abcd Dot matches "b". Advance pointers.
^ ^
4 1 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 2).
^ ^
5 2 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers.
^ ^
6 2 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 3).
^ ^
7 3 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers.
^ ^
8 3 .(?:(?R)|.?). abcd Try 1st alt. Recursive call (to depth 4).
^ ^
9 4 .(?:(?R)|.?). abcd Dot fails to match end of string.
^ ^ DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
Give back text consumed by depth 4 (?R) = ""
10 3 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches EOS.
^ ^ Advance regex pointer.
11 3 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ DEPTH 3 (?R) FAILS. Return to step 6 depth 2
Give back text consumed by depth3 (?R) = "d"
12 2 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "d".
^ ^ Advance pointers.
13 2 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to step 12 depth 2
14 2 .(?:(?R)|.?). abcd Match zero "d" (give it back).
^ ^ Advance regex pointer.
15 2 .(?:(?R)|.?). abcd Dot matches "d". Advance pointers.
^ ^ DEPTH 2 (?R) SUCCEEDS.
Return to step 4 depth 1
16 1 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to try other alternative. Give back
text consumed by depth 2 (?R) = "cd"
17 1 .(?:(?R)|.?). abcd Optional dot matches "c". Advance pointers.
^ ^
18 1 .(?:(?R)|.?). abcd Required dot matches "d". Advance pointers.
^ ^ DEPTH 1 (?R) SUCCEEDS.
Return to step 2 depth 0
19 0 .(?:(?R)|.?). abcd Required dot fails to match end of string.
^ ^ Backtrack to try other alternative. Give back
text consumed by depth 1 (?R) = "bcd"
20 0 .(?:(?R)|.?). abcd Try 2nd alt. Optional dot matches "b".
^ ^ Advance pointers.
21 0 .(?:(?R)|.?). abcd Dot matches "c". Advance pointers.
^ ^ SUCCESSFUL MATCH of "abc"
'''
There is nothing wrong with the regex engine. The correct match is abc
(or aaa
for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.
IMPORTANT: This describes recursive regex in PHP (which uses the PCRE library). Recursive regex works a bit differently in Perl itself.
Note: This is explained in the order you can conceptualize it. The regex engine does it backward of this; it dives down to the base case and works its way back.
Since your outer a
s are explicitly there, it will match an a
between two a
s, or a previous recursion's match of the entire pattern between two a
s. As a result, it will only match odd numbers of a
s (middle one plus multiples of two).
At length of three, aaa
is the current recursion's matching pattern, so on the fourth recursion it's looking for an a
between two a
s (i.e., aaa
) or the previous recursion's matched pattern between two a
s (i.e., a
+aaa
+a
). Obviously it can't match five a
s when the string isn't that long, so the longest match it can make is three.
Similar deal with a length of six, as it can only match the "default" aaa
or the previous recursion's match surrounded by a
s (i.e., a
+aaaaa
+a
).
However, it does not match all odd lengths.
Since you're matching recursively, you can only match the literal aaa
or a
+(prev recurs match)+a
. Each successive match will therefore always be two a
s longer than the previous match, or it will punt and fall back to aaa
.
At a length of seven (matching against aaaaaaa
), the previous recursion's match was the fallback aaa
. So this time, even though there are seven a
s, it will only match three (aaa
) or five (a
+aaa
+a
).
When looping to longer lengths (80 in this example), look at the pattern (showing only the match, not the input):
no match
aa
aaa
aaa
aaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
aaaaa
aaaaaaa
aaaaaaaaa
aaaaaaaaaaa
aaaaaaaaaaaaa
aaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaa
What's going on here? Well, I'll tell you! :-)
When a recursive match would be one character longer than the input string, it punts back to aaa
, as we've seen. In every iteration after that, the pattern starts over of matching two more characters than the previous match. Every iteration, the length of the input increases by one, but the length of the match increases by two. When the match size finally catches back up and surpasses the length of the input string, it punts back to aaa
. And so on.
Alternatively viewed, here we can see how many characters longer the input is compared to the match length in each iteration:
(input len.) - (match len.) = (difference)
1 - 0 = 1
2 - 2 = 0
3 - 3 = 0
4 - 3 = 1
5 - 5 = 0
6 - 3 = 3
7 - 5 = 2
8 - 7 = 1
9 - 9 = 0
10 - 3 = 7
11 - 5 = 6
12 - 7 = 5
13 - 9 = 4
14 - 11 = 3
15 - 13 = 2
16 - 15 = 1
17 - 17 = 0
18 - 3 = 15
19 - 5 = 14
20 - 7 = 13
21 - 9 = 12
22 - 11 = 11
23 - 13 = 10
24 - 15 = 9
25 - 17 = 8
26 - 19 = 7
27 - 21 = 6
28 - 23 = 5
29 - 25 = 4
30 - 27 = 3
31 - 29 = 2
32 - 31 = 1
33 - 33 = 0
34 - 3 = 31
35 - 5 = 30
36 - 7 = 29
37 - 9 = 28
38 - 11 = 27
39 - 13 = 26
40 - 15 = 25
41 - 17 = 24
42 - 19 = 23
43 - 21 = 22
44 - 23 = 21
45 - 25 = 20
46 - 27 = 19
47 - 29 = 18
48 - 31 = 17
49 - 33 = 16
50 - 35 = 15
51 - 37 = 14
52 - 39 = 13
53 - 41 = 12
54 - 43 = 11
55 - 45 = 10
56 - 47 = 9
57 - 49 = 8
58 - 51 = 7
59 - 53 = 6
60 - 55 = 5
61 - 57 = 4
62 - 59 = 3
63 - 61 = 2
64 - 63 = 1
65 - 65 = 0
66 - 3 = 63
67 - 5 = 62
68 - 7 = 61
69 - 9 = 60
70 - 11 = 59
71 - 13 = 58
72 - 15 = 57
73 - 17 = 56
74 - 19 = 55
75 - 21 = 54
76 - 23 = 53
77 - 25 = 52
78 - 27 = 51
79 - 29 = 50
80 - 31 = 49
For reasons that should now make sense, this happens at multiples of 2.
I've slightly simplified the original pattern for this example. Remember this. We will come back to it.
a((?R)|a)a
What the author Jeffrey Friedl means by "the (?R) construct makes a recursive reference to the entire regular expression" is that the regex engine will substitute the entire pattern in place of (?R)
as many times as possible.
a((?R)|a)a # this
a((a((?R)|a)a)|a)a # becomes this
a((a((a((?R)|a)a)|a)a)|a)a # becomes this
# and so on...
When tracing this by hand, you could work from the inside out. In (?R)|a
, a
is your base case. So we'll start with that.
a(a)a
If that matches the input string, take that match (aaa
) back to the original expression and put it in place of (?R)
.
a(aaa|a)a
If the input string is matched with our recursive value, subtitute that match (aaaaa
) back into the original expression to recurse again.
a(aaaaa|a)a
Repeat until you can't match your input using the result of the previous recursion.
Example
Input: aaaaaa
Regex: a((?R)|a)a
Start at base case, aaa
.
Does the input match with this value? Yes: aaa
Recurse by putting aaa
in the original expression:
a(aaa|a)a
Does the input match with our recursive value? Yes: aaaaa
Recurse by putting aaaaa
in the original expression:
a(aaaaa|a)a
Does the input match with our recursive value? No: aaaaaaa
Then we stop here. The above expression could be rewritten (for simplicity) as:
aaaaaaa|aaa
Since it doesn't match aaaaaaa
, it must match aaa
. We're done, aaa
is the final result.
After a lot of experimentation I think the PHP regex engine is broken. The exact same code under Perl works fine and matches all of your strings from beginning to end as I would expect.
Recursive regexes are hard on the imagination, but it looks to me as if /a(?:(?R)|a?)a/
should match aaaa
as an a
..a
pair containing a second a
..a
pair, after which a second recursion fails and the alternate /a?/ matches instead as a null string.
Okay, I finally have it.
I awarded the correct answer to ridgerunner as he put me on the path to the solution, but I also wanted to write a full answer to the specific question in case someone else wants to fully understand the example too.
First the solution, then some notes.
Here is a summary of the steps followed by the engine. The steps should be read from top to bottom. They are not numbered. The recursion depth is shown in the left column, going up from zero to for and back down to zero. For convenience, the expression is shown at the top right. For ease of readability, the "a"s being matched are shown at their place in the string (which is shown at the very top).
STRING EXPRESSION
a a a a a(?:(?R|a?))a
Depth Match Token
0 a first a from depth 0. Next step in the expression: depth 1.
1 a first a from depth 1. Next step in the expression: depth 2.
2 a first a from depth 2. Next step in the expression: depth 3.
3 a first a from depth 3. Next step in the expression: depth 4.
4 depth 4 fails to match anything. Back to depth 3 @ alternation.
3 depth 3 fails to match rest of expression, back to depth 2
2 a a depth 2 completes as a/empty/a, back to depth 1
1 a[a a] a/[detph 2]a fails to complete, discard depth 2, back to alternation
1 a first a from depth 1
1 a a a from alternation
1 a a a depth 1 completes, back to depth 0
0 a[a a a] depth 0 fails to complete, discard depth 1, back to alternation
0 a first a from depth 0
0 a a a from alternation
0 a a a expression ends with successful match
1. The source of confusion
Here is what was counter-intuitive about it for me.
We are trying to match a a a a
I assumed that depth 0 of the recursion would match as a - - a and that depth 1 would match as - a a -
But in fact depth 1 first matches as - a a a
So depth 0 has nowhere to go to finish the match:
a [D1: a a a]
...then what? We are out of characters but the expression is not over.
So depth 1 is discarded. Note that depth 1 is not attempted again by giving back characters, which would lead us to a different depth 1 match of - a a -
That's because recursive matches are atomic. Once a depth matches, it's all or nothing, you keep it all or you discard it all.
Once depth 1 is discarded, depth 0 moves on to the other side of the alternation, and returns the match: a a a
2. The source of clarity
What helped me the most was the example that ridgerunner gave. In his example, he showed how to trace the path of the engine, which is exactly what I wanted to understand.
Following this method, I traced the full path of the engine for our specific example. As I have it, the path is 25 steps long, so it is considerably longer than the summary above. But the summary is accurate to the path I traced.
Big Thanks to everyone else who contributed, in particular Wiseguy for a very intriguing presentation. I still wonder if somehow I might be missing something and Wiseguy's answer might amount to the same!