It is likely an implementation detail, but for the Oracle and IBM JDKs at least is the compiled pattern cached or do we as application developers need to perform the caching
According to [Joshua_Bloch] Effective_Java:
Some object creations are much more expensive than others. If you’re going to need such an “expensive object” repeatedly, it may be advisable to cache it for reuse. Unfortunately, it’s not always obvious when you’re creating such an object. Suppose you want to write a method to determine whether a string is a valid Roman numeral. Here’s the easiest way to do this using a regular expression:
// Performance can be greatly improved!
static boolean isRomanNumeral(String s) {
return s.matches("^(?=.)M*(C[MD]|D?C{0,3})"
+ "(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$");
}
The problem with this implementation is that it relies on the String.matches method. While String.matches is the easiest way to check if a string matches a regular expression, it’s not suitable for repeated use in performance-critical situations. The problem is that it internally creates a Pattern instance for the regular expression and uses it only once, after which it becomes eligible for garbage collection. Creating a Pattern instance is expensive because it requires compiling the regular expression into a finite state machine. To improve the performance, explicitly compile the regular expression into a Pattern instance (which is immutable) as part of class initialization, cache it, and reuse the same instance for every invocation of the isRomanNumeral method:
// Reusing expensive object for improved performance
public class RomanNumerals {
private static final Pattern ROMAN = Pattern.compile(
"^(?=.)M*(C[MD]|D?C{0,3})"
+ "(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$");
static boolean isRomanNumeral(String s) {
return ROMAN.matcher(s).matches();
}}
The improved version of isRomanNumeral provides significant performance gains if invoked frequently. On my machine, the original version takes 1.1 μs on an 8-character input string, while the improved version takes 0.17 μs, which is 6.5 times faster
I've created a class CachedPattern that can cache Pattern objects. If you run the main method you'll see that Java's Pattern objects are in fact different instances, which also consumes memory.
import java.util.HashMap;
import java.util.regex.Pattern;
import org.eclipse.core.runtime.Assert;
public class CachedPattern {
public static void main(String[] args){
Pattern p1 = Pattern.compile("abc");
Pattern p2 = Pattern.compile("abc");
Pattern p3 = Pattern.compile("abc");
Pattern p4 = Pattern.compile("abc");
Pattern p5 = Pattern.compile("abc");
Pattern x1 = CachedPattern.compile("abc");
Pattern x2 = CachedPattern.compile("abc");
Pattern x3 = CachedPattern.compile("abc");
Pattern x4 = CachedPattern.compile("abc");
Pattern x5 = CachedPattern.compile("abc");
// are cached objects the same ? YES!
Assert.isTrue(x1.equals(x2));
Assert.isTrue(x1.equals(x3));
Assert.isTrue(x1.equals(x4));
Assert.isTrue(x1.equals(x5));
// are non-cached objects the same ? NO!
Assert.isTrue(p1.equals(p2)); //AssertionFailedException
}
private static HashMap<String, Pattern> cached = new HashMap<>();
/**
* This value must be unique, to make sure user won't use this inside "regex" variable,
* so that objects without flags would be returned
* For example if UNIQUE_HASH would be empty:
* compile(pattern = "abc1")
* VS.
* compile(pattern = "abc", flag = 1)
* This would give same keys "abc1" and "abc1"
*/
private static final String UNIQUE_HASH = "(())[]+@#$%^@!@#$%*";
public static Pattern compile(String regex){
if(cached.containsKey(regex)){
return cached.get(regex);
}
Pattern p = Pattern.compile(regex);
cached.put(regex, p);
return p;
}
public static Pattern compile(String regex, int flags){
String uniqueKey = regex + UNIQUE_HASH + flags;
if(cached.containsKey(uniqueKey)){
return cached.get(uniqueKey);
}
Pattern p = Pattern.compile(regex);
cached.put(uniqueKey, p);
return p;
}
}
I don't believe the results are cached and there's no evidence of such behaviour in the code or the documentation. It would (of course) be relatively trivial to implement such a cache yourself, but I would be interested in a use case in which such caching is beneficial.
Re. the comment below and String.split(), there's a different approach in that the code takes a distinct path for trivial 1 or 2 char patterns vs more complex regexps. But it still doesn't appear to cache.
As far as I know from looking at the code (JDK 6) it doesn't do caching but once constructed, Pattern object could be cached on application side and shared among multiple threads. Standard pattern seems to be to assign it to final static variable:
private static final Pattern p = Pattern.compile(",");
It doesn't. If you have performance sensitive areas, you might want to hold your pattern objects as member variables.
Clojure does this more or less automatically when you have a regex in a function though.