问题
I got StackOverflowError
when matching the result using a RegEx pattern.
The pattern is (\d\*?(;(?=\d))?)+
. This regex is used to validate the input:
12345;4342;234*;123*;344324
The input is a string consists of values (only digits) separated by ;
. Each value could include one *
at the end (used as wildcard for other matching). There is no ;
at the end of the string.
The problem is that this regex works fine which small number of values. But when the numbers of values is too large (over 300), it will cause StackOverflowError
.
final String TEST_REGEX = "(\\d\\*?(;(?=\\d))?)+";
// Generate string
StringBuilder builder = new StringBuilder();
int number = 123456;
for (int count = 1; count <= 300; count++) {
builder.append(Integer.toString(number).concat(";"));
number++;
}
builder.deleteCharAt(builder.lastIndexOf(";"))
builder.toString().matches(TEST_REGEX); //<---------- StackOverflowError
And the stacktrace:
java.lang.StackOverflowError
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$Ques.match(Pattern.java:4079)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3715)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4556)
at java.util.regex.Pattern$Loop.match(Pattern.java:4683)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4615)
...
I think the lookahead in the pattern cause this error since there are lots of lookup but I haven't figured out how to reduce it or work around.
I really appreciate any suggestion since I'm not experienced in RegEx.
回答1:
Before fixing the problem with StackOverflowError
...
I would like to point out that your current regex
(\d\*?(;(?=\d))?)+
fails to validate this condition.Each value could include one * at the end (used as wildcard for other matching)
It fails to reject the case
23*4*4*;34*434*34
, as seen here1.Your regex will do unnecessary backtracking on an non-matching input.
Java uses one stack frame for each repetition of the group
(\d\*?(;(?=\d))?)
(which is repeated 1 or more time+
).
A correct regex would be:
\d+\*?(?:;\d+\*?)*
Note that this will reject *
, which is not too clear from your requirement whether you want to accept or reject this.
This doesn't fix the StackOverflow problem, since each repetition of the group (?:;\d+\*?)
is also going to use up stack. To fix that, make all quantifiers possessive, since there is no need for backtracking, as the grammar is not ambiguous:
\d++\*?+(?:;\d++\*?+)*+
Putting into string literal:
"\\d++\\*?+(?:;\\d++\\*?+)*+"
I have tested the regex above with matching and non matching input, which has more than 3600 tokens (separated by ;
).
Footnote
1: regex101 uses PCRE flavor, which is slightly different from Java regex flavor. However, the features used in your regex are common between them, so there should be no discrepancy.
Appendix
Actually, from my testing with your regex
(\d\*?(;(?=\d))?)+
(which is incorrect according to your requirement), making the outer most+
possessive++
seem to fix theStackOverflowError
problem, at least in my testing with around 3600 tokens (separated by;
, the string is around 20k character long). It also doesn't seem to cause long execution time when testing against a non-matching string.In my solution, make the
*
quantifier for the group(?:;\d+\*?)
possessive is enough to resolveStackOverflowError
."\\d+\\*?(?:;\\d+\\*?)*+"
However, I make everything possessive to be on the safe side.
回答2:
You regexp is a bit ineffective and does not match your description. You have '\d\*?' - it is one digit folowed by optional *. Then optional ';(?=\d)' - ';' with lookahead digit. String '1*2*3*' will match you regexp but not your description. You could use follow regexp. It matches you input and a bit more iffective.
final String TEST_REGEX = "(\\d+\\*?)(?:;\\d+\\*?)+";
It will pass the test when count < 300 but still failed for larger values. Use plain string operation like indexOf and substring to verify the input.
回答3:
The thing you may want to do is increase the maximum size of your stack so it doesn't overflow. You can read about how to do that here.
Basically, you start your program with the -Xss
option. For example, -Xss4m
When I started your code with -Xss4m
, your program ran without a stack overflow for me (it returns true
).
来源:https://stackoverflow.com/questions/15082010/stackoverflowerror-when-matching-large-input-using-regex