Writing better regex expression for not using lazy repeat quantifier

风格不统一 提交于 2019-12-12 10:53:15

问题


I have a regular expression:

(<select([^>]*>))(.*?)(</select\s*>)

Since it uses lazy repeat quantifier, for longer strings(having options more than 500) it backtracks for more than 100,000 times and fails. Please help me to find a better regular expression which doesn't use lazy repeat quantifier


回答1:


<select[^>]*>[^<]*(?:<(?!/select>)[^<]*)*</select>

...or in human-readable form:

<select[^>]*>    # start tag
[^<]*            # anything except opening bracket
(?:              # if you find an open bracket
  <(?!/select>)  #   match it if it's not part of end tag
  [^<]*          #   consume any more non-brackets
)*               # repeat as needed
</select>        # end tag

This is an example of the "unrolled loop" technique Friedl develops in his book, Mastering Regular Expressions. I did a quick test in RegexBuddy using a pattern based on reluctant quantifiers:

(?s)<select[^>]*>.*?</select>

...and it took about 6,000 steps to find a match. The unrolled-loop pattern took only 500 steps. And when I removed the closing bracket from the end tag (</select), making a match impossible, it required only 800 steps to report failure.

If your regex flavor supports possessive quantifiers, go ahead and use them, too:

<select[^>]*+>[^<]*+(?:<(?!/select>)[^<]*+)*+</select>

It takes about the same number of steps to achieve a match, but it can use a lot less memory in the process. And if no match is possible, it fails even more quickly; in my tests it took about 500 steps, the same number it took to find a match.




回答2:


Unfortunately this wont work, see the answer by Alan Moore for a correct example!

(<select([^>]*>))(.*+)(</select\s*>)

From the perl regexp manpage:

By default, when a quantified subpattern does not allow the rest of the overall pattern to match, Perl will backtrack. However, this behaviour is sometimes undesirable. Thus Perl provides the "possessive" quantifier form as well.

       *+     Match 0 or more times and give nothing back
       ++     Match 1 or more times and give nothing back
       ?+     Match 0 or 1 time and give nothing back
       {n}+   Match exactly n times and give nothing back (redundant)
       {n,}+  Match at least n times and give nothing back
       {n,m}+ Match at least n but not more than m times and give nothing back

For instance,

      'aaaa' =~ /a++a/

will never match, as the "a++" will gobble up all the "a"'s in the string and won't leave any for the remaining part of the pattern. This feature can be extremely useful to give perl hints about where it shouldn't backtrack. For instance, the typical "match a double-quoted string" problem can be most efficiently performed when written as:

      /"(?:[^"\\]++|\\.)*+"/


来源:https://stackoverflow.com/questions/4077135/writing-better-regex-expression-for-not-using-lazy-repeat-quantifier

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!