I have an extremely long string that I want to parse for a numeric value that occurs after the substring \"ISBN\". However, this grouping of 13 digits can be arranged differ
If you're going to be calling the method a lot, the best thing you can do is not compile the Pattern inside it. Otherwise, each time you call the method you'll spend more time creating the regex than you will actually searching for it.
But after looking at your code again, I think you have a bigger problem, performance-wise. All that business of locating "ISBN" and then creating substrings to apply the regex to is completely unnecessary. Let the regex do that stuff; it's what they're for. The following regex finds the "ISBN" sentinel and the following thirteen digits, if they're there:
static final Pattern isbnPattern = Pattern.compile(
"\\bISBN[^A-Z0-9]*+(\\d(?:-*+\\d){12})", Pattern.CASE_INSENSITIVE );
The [^A-Z0-9]*+
gobbles up whatever characters may appear between the "ISBN" and the first digit. The possessive quantifier (*+
) prevents needless backtracking; if the next character is not a digit, the regex engine immediately quits that match attempt and resumes scanning for another "ISBN" instance.
I used another possessive quantifier for the optional hyphens, plus a non-capturing group ((?:...)
) for the repeated portion; that gives another slight performance gain over the capturing groups most of the other responders are using. But I used a capturing group for the whole number, so it can be extracted from the overall match easily. With these changes, your method reduces to this:
public String parseISBN (String source) {
Matcher m = isbnPattern.matcher(source);
return m.find() ? m.group(1) : null;
}
...and it's much more efficient, too. Note that we haven't addressed how the strings are getting into memory. If you're doing the I/O yourself, it's possible there are significant performance gains to be achieved in that area, too.