StackOverflowError when matching large input using RegEx

夙愿已清 提交于 2019-11-28 05:24:40

问题


I got StackOverflowError when matching the result using a RegEx pattern.

The pattern is (\d\*?(;(?=\d))?)+. This regex is used to validate the input:

12345;4342;234*;123*;344324

The input is a string consists of values (only digits) separated by ;. Each value could include one * at the end (used as wildcard for other matching). There is no ; at the end of the string.

The problem is that this regex works fine which small number of values. But when the numbers of values is too large (over 300), it will cause StackOverflowError.

final String TEST_REGEX = "(\\d\\*?(;(?=\\d))?)+";

// Generate string
StringBuilder builder = new StringBuilder();
int number = 123456;
for (int count = 1; count <= 300; count++) {
    builder.append(Integer.toString(number).concat(";"));
    number++;
}
builder.deleteCharAt(builder.lastIndexOf(";"))

builder.toString().matches(TEST_REGEX); //<---------- StackOverflowError

And the stacktrace:

java.lang.StackOverflowError
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
    at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
    ...

I think the lookahead in the pattern cause this error since there are lots of lookup but I haven't figured out how to reduce it or work around.

I really appreciate any suggestion since I'm not experienced in RegEx.


回答1:


Before fixing the problem with StackOverflowError...

  1. I would like to point out that your current regex (\d\*?(;(?=\d))?)+ fails to validate this condition.

    Each value could include one * at the end (used as wildcard for other matching)

    It fails to reject the case 23*4*4*;34*434*34, as seen here1.

  2. Your regex will do unnecessary backtracking on an non-matching input.

  3. Java uses one stack frame for each repetition of the group (\d\*?(;(?=\d))?) (which is repeated 1 or more time +).

A correct regex would be:

\d+\*?(?:;\d+\*?)*

Note that this will reject *, which is not too clear from your requirement whether you want to accept or reject this.

This doesn't fix the StackOverflow problem, since each repetition of the group (?:;\d+\*?) is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:

\d++\*?+(?:;\d++\*?+)*+

Putting into string literal:

"\\d++\\*?+(?:;\\d++\\*?+)*+"

I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;).

Footnote

1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.

Appendix

  • Actually, from my testing with your regex (\d\*?(;(?=\d))?)+ (which is incorrect according to your requirement), making the outer most + possessive ++ seem to fix the StackOverflowError problem, at least in my testing with around 3600 tokens (separated by ;, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.

  • In my solution, make the * quantifier for the group (?:;\d+\*?) possessive is enough to resolve StackOverflowError.

    "\\d+\\*?(?:;\\d+\\*?)*+"
    

    However, I make everything possessive to be on the safe side.




回答2:


You regexp is a bit ineffective and does not match your description. You have '\d\*?' - it is one digit folowed by optional *. Then optional ';(?=\d)' - ';' with lookahead digit. String '1*2*3*' will match you regexp but not your description. You could use follow regexp. It matches you input and a bit more iffective.

final String TEST_REGEX = "(\\d+\\*?)(?:;\\d+\\*?)+";

It will pass the test when count < 300 but still failed for larger values. Use plain string operation like indexOf and substring to verify the input.




回答3:


The thing you may want to do is increase the maximum size of your stack so it doesn't overflow. You can read about how to do that here.

Basically, you start your program with the -Xss option. For example, -Xss4m When I started your code with -Xss4m, your program ran without a stack overflow for me (it returns true).



来源:https://stackoverflow.com/questions/15082010/stackoverflowerror-when-matching-large-input-using-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!