Fixing Catastrophic Backtracking in Regular Expression

眉间皱痕 提交于 2019-12-28 03:05:19

问题


The Problem

I'm using the following regular expression to check for valid file paths:

^(?:[a-zA-Z]\:\\|\\\\)([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})+$

Using the test string V:\Sample Names\Libraries\DeveloperLib\DeveloperComDlgs\res is recognized as valid. I can even add invalid characters to the beginning of the string without issues. However, when I add an invalid character towards the end of the string, the webpage freezes up from catastrophic backtracking.

What is causing this in this regex string?


Breaking Down the Regex

Full string: ^(?:[a-zA-Z]\:\\|\\\\)([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})+$

First Group: (?:[a-zA-Z]\:\\|\\\\)

  • Checks for either
    • A capital or lowercase alphabetical letter followed by a colon and a backslash
    • A double backslash

Second Group: ([^\\\/\:\*\?\<\>\"\|]+(\\){0,1})

  • First Part: [^\\\/\:\*\?\<\>\"\|]+
    • Makes sure there are no illegal characters ( \ / : * ? < > " | )
  • Second Part: (\\){0,1}
    • Checks for a backslash between sections as many times as necessary

I think it may be the {0, 1} causing the issue since this allows for backtracking but I am not sure. Any thoughts?


回答1:


Your current regex can be written as ^(?:[a-zA-Z]:\\|\\\\)([^\\\/\:*?<>"|]+\\?)+$: pay attention at the ? quantifier (it is equal to {0,1} limiting quantifier) after \\ inside a + quantified group.

Once such a pattern like (a+b?)+ is present inside a pattern, there is a high chance of a catastrophical backtracking. Everything is nice when there is a match, say, c:\12\34\aaaaaaaaaaaaaaaaaaa is matched fine, but once a char that is not allowed appears causing a no-match, (try adding * at the end, c:\12\34\aaaaaaaaaaaaaaaaaaa*), the issue will appear.

To solve this, the quantified subpatterns that can match the same text cannot follow one another in immediate succession. And using optional groups where each subpattern is obligatory enables this.

In most scenarios, you need to replace these pattern parts with unrolled a+(ba+)* (1 or more occurrences of a followed with 0 or more sequences of b (that is no longer optional by itself) and then 1 or more occurrences of a (so, between one a and the next a there must be a b). If you need to match an optional \ at the end (as ^(a+b?)+$ actually may match b at the end of the string), you need to add a b? at the end: a+(ba+)*b?.

So, translating this to your current scenario:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*$

or if the \ is allowed at the end:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*\\?$
                     |      a+       (   b       a+       )* b?

See how it fails gracefully upon a no match, or matches as expected.

As @anubhava suggests, you can further enhance the performance by using possessive quantifiers (or atomic groups instead, since, e.g. .NET regex engine does not support possessives) that disallow any backtracking into the grouped patterns. Once matched, these patterns are not re-tried, thus, failure may come much quicker:

^(?:[a-zA-Z]:\\|\\\\)[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*+\\?$
                                                            ^

or an atomic group example:

^(?:[a-zA-Z]:\\|\\\\)(?>[^\\\/\:*?<>"|]+(?:\\[^\\\/\:*?<>"|]+)*)\\?$
                     ^^^                                       ^                          

Note that : is not a special regex metacharacter and should not be escaped. Inside a character class, only -, ^, \ and ] usually require escaping, all others are not special there either.

See more about catastrophical backtracking at The Explosive Quantifier Trap.



来源:https://stackoverflow.com/questions/45463148/fixing-catastrophic-backtracking-in-regular-expression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!