问题
I read http://swtch.com/~rsc/regexp/regexp1.html and in it the author says that in order to have backreferences in regexs, one needs backtracking when matching, and that makes the worst-case complexity exponential. But I don't see exactly why backreferences introduce the need for backtracking. Can someone explain why, and perhaps provide an example (regex and input)?
回答1:
To get directly at your question, you should make a short study of the Chomsky Hierarchy. This is an old and beautiful way of organizing formal languages in sets of increasing complexity. The lowest rung of the hierarchy is the Regular Languages. You might guess - and you'd be right - that the RL's are exactly those that can be represented with "pure" regular expressions: Those with only the alphabet, empty string, concatenation, alternation |, and Kleene star * (look Ma, no back references). A classic theorem of formal language theory - Kleene's Theorem - is that DFAs, NFAs (as described in the article you cited), and regular expressions all have exactly the same power to represent and recognize languages. Thompson's construction given in the article is a part of the theorem's proof.
Every RL is also a CFL. But there are infinitely many CFLs that aren't regular. A feature that can exist in CFL's that makes them too complex to be regular is balanced pairs of things: parentheses, begin-end blocks, etc. Nearly all programming languages are CFLs. CFLs can be efficiently recognized by what's called a pushdown automaton This essentially a NFA with a stack glued on. The stack grows to be as big as needed, so it's no longer a finite automaton. Parsers of real programming languages are nearly all variations on pushdown automata.
Consider the regex with backreference
^(b*a)\1$
In words, this represents strings of length 2n for some n, where both the n'th and 2n'th characters are a
and all other characters are b
. This is a perfect example of a a CFL that's not regular. You can rigorously prove this with another cool formal language tool called the pumping lemma.
This is exactly why back references cause problems! They allow "regular expressions" that represent languages that aren't regular. Therefore there is no NFA or DFA that can ever recognize them.
But wait, it's even worse than I've made it out to be so far. Consider
^(b*a)\1\1$
We now have a string of length 3n where the n'th, 2n'th, and 3n'th elements are a
and all others are b
. There is another flavor of the pumping lemma that allows a proof that this language is even too complex to be a CFL! No pushdown automaton can recognize this one.
Back references allow these supercharged regexes to represent languages that are three rungs up the Chomsky Hierarchy: the Context Sensitive Languages. Roughly speaking, the only way to recognize a CSL is to check all strings in the language of equal length (at least if P!=NP, but that's true for all practical purposes and a different story altogether). The number of such strings is exponential in the length of the one you're matching.
This is why the searching regex matcher is needed. You can be very clever in the way you design the search. But there will always be some input that drives it to take exponential time.
So I agree with the author of the paper you cited. It's possible to write perfectly innocent looking regexes with no back refs that will be efficiently recognized for nearly all inputs, but where there exists some input that causes a Perl or Java or Python regex matcher - because it is a backtracking search - to require millions of years to complete the match. This is crazy. You can have a script that's correct and works fine for years and then locks up one day merely because it stumbled onto one of the bad inputs. Suppose the regex is buried in the message parser of the navigation system in the airplane you're riding...
Edit
By request, I'll sketch how the Pumping lemma can be used to prove the language a^k b a^k b
is not regular. Here a^k
is a shorthand for a
repeated k
times. The PL says that there must exist a positive integer N such that every string in a regular language of length at least N must be of the form R S T such that R S^k T are also in the language for all natural k. Here R
, S
, T
are strings and S
may not be empty.
Proof of the PL depends on the fact that every regular language corresponds to some DFA. An accepted input to this DFA longer than its number of states (which equates to L in the lemma) must cause it to "loop:" to repeat a state. Call this state X. The machine consumes some string R to get from the start to X, then S to loop back to X, then T to get to an accepting state. Well, adding extra copies of S (or else deleting S) in the input correspond only to a different number of "loops" from X back to X. Consequently, the new string with additional (or deleted) copies of S will also be accepted.
Since every RL must satisfy the PL, a proof that a language is not regular proceeds by showing that it contradicts the PL. For our language, this is not hard. Suppose you are trying to convince me the language L = a^k b a^k b
satisfies the PL. Because it does so, you must be able to give me some value of N (see above): the number of states in a hypothetical DFA that recognizes L. At that point, I will say, "Okay Mr. Regular Guy, consider the string B=a^N b a^N b
." If L is regular, B must cause this DFA (no matter what it looks like) to loop within the first N characters, which must be all a
s! So the loop (string S above) consists of all a
s, also. With this I can immediately show that your claim about L being regular is false. I just choose to go around the loop a second time. This will cause this hypothetical DFA of yours to accept a new string a^M b a^N b
, where M>N because I added a
s to its first half. Ouch! This new string is not in L, so the PL is not true after all. Since I can do this trick every time no matter what N you provided, the PL cannot hold for L, and L cannot be regular after all.
Since it's not regular, Kleene's theorem tells us there is no DFA nor FA nor "pure" regex that describes it.
The proof that back refs allow languages that aren't even context free has a very similar ring but needs background on pushdown automata that I'm not going to give here. Google will provide.
NB: Both of these fall short of proof that back refs make recognition NP complete. They merely say in a very rigorous way that back refs add real complexity to pure regular expressions. They allow languages that can't be recognized with any machine having finite memory, nor any with only an infinitely large LIFO memory. I will leave NP completeness proof to others.
回答2:
NFA and DFA are Finite Automata, aka finite-state machine which are "abstract machine that can be in one of a finite number of states"[1]. Note the finite number of states.
The fast NFA/DFA algorithms discussed in the linked article, Regular Expression Matching Can Be Simple And Fast, are fast because they can work with a finite number of states (independent of input length) as described in the article.
Introducing backreferences makes the number of states (almost) "infinite" (in worst case about 256n where n is the length of the input). The number of states grows because every possible value of every backreference becomes a state of the automata.
Thus using a finite-state machine is no longer fitting/possible, and backtracking algorithms have to be used instead.
回答3:
There's some excellent examples in this tutorial:
http://www.regular-expressions.info/brackets.html
The particular case that you will be interested in is shown in 'Backtracking Into Capturing Groups' - it's explained there how the whole match can be given up several times before the final one can be found that matches the whole regex. Also, it's worth noting that this might lead to unexpected matches.
回答4:
Very interesting document: Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions, support back-references and counted occurrences efficiently with modified NFA.
来源:https://stackoverflow.com/questions/11101405/how-do-backreferences-in-regexes-make-backtracking-required