I found this excellent tutorial on regular expressions and while I intuitively understand what \"greedy\", \"reluctant\" and \"possessive\" quantifiers do, there seems to be
I haven't heard the exact terms 'regurgitate' or 'backing off' before; the phrase that would replace these is "backtracking", but 'regurgitate' seems like as good a phrase as any for "the content that had been tentatively accepted before backtracking threw it away again".
The important thing to realize about most regex engines is that they are backtracking: they will tentatively accept a potential, partial match, while trying to match the entire contents of the regex. If the regex cannot be completely matched at the first attempt, then the regex engine will backtrack on one of its matches. It will try matching *
, +
, ?
, alternation, or {n,m}
repetition differently, and try again. (And yes, this process can take a long time.)
The first example uses the greedy quantifier .* to find "anything", zero or more times, followed by the letters "f" "o" "o". Because the quantifier is greedy, the .* portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o") have already been consumed (by whom?).
The last three letters, f
, o
, and o
were already consumed by the initial .*
portion of the rule. However, the next element in the regex, f
, has nothing left in the input string. The engine will be forced to backtrack on its initial .*
match, and try matching all-but-the-last character. (It might be smart and backtrack to all-but-the-last-three, because it has three literal terms, but I'm unaware of implementation details at this level.)
So the matcher slowly backs off (from right-to-left?) one letter at a time until the rightmost occurrence of "foo" has been regurgitated (what does this mean?), at which
This means the foo
had tentatively been including when matching .*
. Because that attempt failed, the regex engine tries accepting one fewer character in .*
. If there had been a successful match before the .*
in this example, then the engine would probably try shortening the .*
match (from right-to-left, as you pointed out, because it is a greedy qualifier), and if it was unable to match the entire inputs, then it might be forced to re-evaluate what it had matched before the .*
in my hypothetical example.
point the match succeeds and the search ends.
The second example, however, is reluctant, so it starts by first consuming (by whom?) "nothing". Because "foo"
The initial nothing is consumed by .?*
, which will consume the shortest possible amount of anything that allows the rest of the regex to match.
doesn't appear at the beginning of the string, it's forced to swallow (who swallows?) the
Again the .?*
consumes the first character, after backtracking on the initial failure to match the entire regex with the shortest possible match. (In this case, the regex engine is extending the match for .*?
from left-to-right, because .*?
is reluctant.)
first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by .*+, (how?)
A .*+
will consume as much as possible, and will not backtrack to find new matches when the regex as a whole fails to find a match. Because the possessive form does not perform backtracking, you probably won't see many uses with .*+
, but rather with character classes or similar restrictions: account: [[:digit:]]*+ phone: [[:digit:]]*+
.
This can drastically speed up regex matching, because you're telling the regex engine that it should never backtrack over potential matches if an input doesn't match. (If you had to write all the matching code by hand, this would be similar to never using putc(3)
to 'push back' an input character. It would be very similar to the naive code one might write on a first try. Except regex engines are way better than a single character of push-back, they can rewind all the back to zero and try again. :)
But more than potential speed ups, this also can let you write regexs that match exactly what you need to match. I'm having trouble coming up with an easy example :) but writing a regex using possessive vs greedy quantifiers can give you different matches, and one or the other may be more appropriate.
leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off (what does back off mean?); it will outperform
"Backing off" in this context means "backtracking" -- throwing away a tentative partial match to try another partial match, which may or may not succeed.
the equivalent greedy quantifier in cases where the match is not immediately found.