Regex search with pattern containing (?:.|\s)*? takes increasingly long time

后端 未结 1 670
伪装坚强ぢ
伪装坚强ぢ 2020-12-12 03:19

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. I suspect catastrophic backtracking.

相关标签:
1条回答
  • 2020-12-12 04:11

    The alternation (?:.|\\s)+? is very inefficient, as it involves too much backtracking.

    Basically, all variations of this pattern are extremely inefficient: (?:.|\s)*?, (?:.|\n)*?, (?:.|\r\n)*? and there greedy counterparts, too ((?:.|\s)*, (?:.|\n)*, (?:.|\r\n)*). (.|\s)*? is probably the worst of them all.

    Why?

    The two alternatives, . and \s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.

    Patterns like (?:.|\n)*? / (?:.|\n)* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.|\n) is the only way to match any char).

    Solution

    If you want to match any characters including whitespace with regex, do it with

    [\\s\\S]*?
    

    Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).

    NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.

    0 讨论(0)
提交回复
热议问题