I\'m trying to use the Daring Fireball Regular Expression for matching URLs in Java, and I\'ve found a URL that causes the evaluation to take forever. I\'ve modified the or
The problem is here:
"(?:" + // One or more:
"[^\\s()<>]+" + // Run of non-space, non-()<>
"|" + // or
"\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]+\\)))*\\)" + // balanced parens, up to 2 levels
")+"
What you've got here is nested quantifiers. This plays havoc with any backtracking algorithm - as an example, consider the regex /^(a+)+$/
matching against the string
aaaaaaaaaab
As a first attempt, the inner quantifier will match all of the a
s. Then the regex fails, so it backs off one. Then the outer quantifier tries to match again, swallowing up the last a
, then the regex fails once more. We basically get exponential behaviour as the quantifiers try all sorts of ways of splitting up the run of a
s, without actually making any progress.
The solution is possessive quantifiers (which we denote by tacking a +
onto the end of a quantifier) - we set up the inner quantifiers so that once they have a match, they don't let it go - they'll hold onto that until the match fails or an earlier quantifier backs off and they have to rematch starting somewhere else in the string. If we instead used /^(a++)+$/
as our regex, we would fail immediately on the non-matching string above, rather than going exponential trying to match it.
Try making those inner quantifiers possessive and see if it helps.