StackOverflowError when matching large input using RegEx

喜欢而已 提交于 2019-11-29 11:45:53

Before fixing the problem with StackOverflowError...

  1. I would like to point out that your current regex (\d\*?(;(?=\d))?)+ fails to validate this condition.

    Each value could include one * at the end (used as wildcard for other matching)

    It fails to reject the case 23*4*4*;34*434*34, as seen here1.

  2. Your regex will do unnecessary backtracking on an non-matching input.

  3. Java uses one stack frame for each repetition of the group (\d\*?(;(?=\d))?) (which is repeated 1 or more time +).

A correct regex would be:

\d+\*?(?:;\d+\*?)*

Note that this will reject *, which is not too clear from your requirement whether you want to accept or reject this.

This doesn't fix the StackOverflow problem, since each repetition of the group (?:;\d+\*?) is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:

\d++\*?+(?:;\d++\*?+)*+

Putting into string literal:

"\\d++\\*?+(?:;\\d++\\*?+)*+"

I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;).

Footnote

1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.

Appendix

  • Actually, from my testing with your regex (\d\*?(;(?=\d))?)+ (which is incorrect according to your requirement), making the outer most + possessive ++ seem to fix the StackOverflowError problem, at least in my testing with around 3600 tokens (separated by ;, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.

  • In my solution, make the * quantifier for the group (?:;\d+\*?) possessive is enough to resolve StackOverflowError.

    "\\d+\\*?(?:;\\d+\\*?)*+"
    

    However, I make everything possessive to be on the safe side.

You regexp is a bit ineffective and does not match your description. You have '\d\*?' - it is one digit folowed by optional *. Then optional ';(?=\d)' - ';' with lookahead digit. String '1*2*3*' will match you regexp but not your description. You could use follow regexp. It matches you input and a bit more iffective.

final String TEST_REGEX = "(\\d+\\*?)(?:;\\d+\\*?)+";

It will pass the test when count < 300 but still failed for larger values. Use plain string operation like indexOf and substring to verify the input.

The thing you may want to do is increase the maximum size of your stack so it doesn't overflow. You can read about how to do that here.

Basically, you start your program with the -Xss option. For example, -Xss4m When I started your code with -Xss4m, your program ran without a stack overflow for me (it returns true).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!