Say I\'m running a service where users can submit a regex to search through lots of data. If the user submits a regex that is very slow (ie. takes minutes for Matcher.find()
Another workaround would be to limit the region of the matcher, then call find()
, repeating until the thread is interrupted or a match is found.
Maybe what you need is a new lib which implements the NFA algorithm.
The NFA algorithm is hundreds times faster than the algorithm which is used by Java standard library.
And the Java std lib is sensitive to the input regexp, which may make your problem happen -- some input make the CPU run for years.
And the timeout can be set by the NFA algorithm through the steps it uses. It is effective than the Thread solution. Trust me I use thread timeout to a relative problem, it is horrible for performance. I finally fix the problem by modify the main loop of my algorithm implement. I insert some check point to the main loop to test the time.
The detail can be found here: https://swtch.com/~rsc/regexp/regexp1.html .
With a little variation it is possible to avoid using additional threads for this:
public class RegularExpressionUtils {
// demonstrates behavior for regular expression running into catastrophic backtracking for given input
public static void main(String[] args) {
Matcher matcher = createMatcherWithTimeout(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 2000);
System.out.println(matcher.matches());
}
public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, int timeoutMillis) {
Pattern pattern = Pattern.compile(regularExpression);
return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern, int timeoutMillis) {
CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
regularExpressionPattern.pattern());
return regularExpressionPattern.matcher(charSequence);
}
private static class TimeoutRegexCharSequence implements CharSequence {
private final CharSequence inner;
private final int timeoutMillis;
private final long timeoutTime;
private final String stringToMatch;
private final String regularExpression;
public TimeoutRegexCharSequence(CharSequence inner, int timeoutMillis, String stringToMatch, String regularExpression) {
super();
this.inner = inner;
this.timeoutMillis = timeoutMillis;
this.stringToMatch = stringToMatch;
this.regularExpression = regularExpression;
timeoutTime = System.currentTimeMillis() + timeoutMillis;
}
public char charAt(int index) {
if (System.currentTimeMillis() > timeoutTime) {
throw new RuntimeException("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
+ regularExpression + "' on input '" + stringToMatch + "'!");
}
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch, regularExpression);
}
@Override
public String toString() {
return inner.toString();
}
}
}
Thanks a lot to dawce for pointing me to this solution in answer to an unnecessary complicated question !
I included a counter in order to check every n reads of charAt, in order to reduce the overhead.
Notes:
Some people stated that carAt may not be call frequently enough. I just added the foo variable in order to demostrate how much charAt is called, and that it is frequent enough. If you're going to use this in production, remove that counter, as it will decrease performance and end up overflowing long if ran in a server for long time. In this example, charAt is called 30 million times every 0.8 secs or so (not tested with proper microbenchmarking conditions, it is just a proof of concept). You can set a lower checkInterval if you want higher precission, at the cost of performance (System.currentTimeMillis() > timeoutTime is more expensive than the if clause on the long run.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.goikosoft.test.RegexpTimeoutException;
/**
* Allows to create timeoutable regular expressions.
*
* Limitations: Can only throw RuntimeException. Decreases performance.
*
* Posted by Kris in stackoverflow.
*
* Modified by dgoiko to ejecute timeout check only every n chars.
* Now timeout < 0 means no timeout.
*
* @author Kris https://stackoverflow.com/a/910798/9465588
*
*/
public class RegularExpressionUtils {
public static long foo = 0;
// demonstrates behavior for regular expression running into catastrophic backtracking for given input
public static void main(String[] args) {
long millis = System.currentTimeMillis();
// This checkInterval produces a < 500 ms delay. Higher checkInterval will produce higher delays on timeout.
Matcher matcher = createMatcherWithTimeout(
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "(x+x+)+y", 10000, 30000000);
try {
System.out.println(matcher.matches());
} catch (RuntimeException e) {
System.out.println("Operation timed out after " + (System.currentTimeMillis() - millis) + " milliseconds");
}
System.out.print(foo);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, String regularExpression, long timeoutMillis,
int checkInterval) {
Pattern pattern = Pattern.compile(regularExpression);
return createMatcherWithTimeout(stringToMatch, pattern, timeoutMillis, checkInterval);
}
public static Matcher createMatcherWithTimeout(String stringToMatch, Pattern regularExpressionPattern,
long timeoutMillis, int checkInterval) {
if (timeoutMillis < 0) {
return regularExpressionPattern.matcher(stringToMatch);
}
CharSequence charSequence = new TimeoutRegexCharSequence(stringToMatch, timeoutMillis, stringToMatch,
regularExpressionPattern.pattern(), checkInterval);
return regularExpressionPattern.matcher(charSequence);
}
private static class TimeoutRegexCharSequence implements CharSequence {
private final CharSequence inner;
private final long timeoutMillis;
private final long timeoutTime;
private final String stringToMatch;
private final String regularExpression;
private int checkInterval;
private int attemps;
TimeoutRegexCharSequence(CharSequence inner, long timeoutMillis, String stringToMatch,
String regularExpression, int checkInterval) {
super();
this.inner = inner;
this.timeoutMillis = timeoutMillis;
this.stringToMatch = stringToMatch;
this.regularExpression = regularExpression;
timeoutTime = System.currentTimeMillis() + timeoutMillis;
this.checkInterval = checkInterval;
this.attemps = 0;
}
public char charAt(int index) {
if (this.attemps == this.checkInterval) {
foo++;
if (System.currentTimeMillis() > timeoutTime) {
throw new RegexpTimeoutException(regularExpression, stringToMatch, timeoutMillis);
}
this.attemps = 0;
} else {
this.attemps++;
}
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new TimeoutRegexCharSequence(inner.subSequence(start, end), timeoutMillis, stringToMatch,
regularExpression, checkInterval);
}
@Override
public String toString() {
return inner.toString();
}
}
}
And the custom exception, so you can catch only THAT exception to avoid swalowing other RE Pattern / Matcher may throw.
public class RegexpTimeoutException extends RuntimeException {
private static final long serialVersionUID = 6437153127902393756L;
private final String regularExpression;
private final String stringToMatch;
private final long timeoutMillis;
public RegexpTimeoutException() {
super();
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String message, Throwable cause) {
super(message, cause);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String message) {
super(message);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(Throwable cause) {
super(cause);
regularExpression = null;
stringToMatch = null;
timeoutMillis = 0;
}
public RegexpTimeoutException(String regularExpression, String stringToMatch, long timeoutMillis) {
super("Timeout occurred after " + timeoutMillis + "ms while processing regular expression '"
+ regularExpression + "' on input '" + stringToMatch + "'!");
this.regularExpression = regularExpression;
this.stringToMatch = stringToMatch;
this.timeoutMillis = timeoutMillis;
}
public String getRegularExpression() {
return regularExpression;
}
public String getStringToMatch() {
return stringToMatch;
}
public long getTimeoutMillis() {
return timeoutMillis;
}
}
Based on Andreas' answer. Main credits should go for him and his source.
A long-running pattern matching process can be stopped using the below method.
StateFulCharSequence
class which manages the state of pattern matching. When that state is changed, an exception is thrown on the next call to charAt
method.ScheduledExecutorService
with a required timeout.Here pattern matching is happening in the main thread and there is no need to check the thread interrupt state every time.
public class TimedPatternMatcher {
public static void main(String[] args) {
ScheduledExecutorService executorService = Executors.newScheduledThreadPool(1);
Pattern pattern = Pattern.compile("some regex pattern");
StateFulCharSequence stateFulCharSequence = new StateFulCharSequence("some character sequence");
Matcher matcher = pattern.matcher(stateFulCharSequence);
executorService.schedule(stateFulCharSequence, 10, TimeUnit.MILLISECONDS);
try {
boolean isMatched = matcher.find();
}catch (Exception e) {
e.printStackTrace();
}
}
/*
When this runnable is executed, it will set timeOut to true and pattern matching is stopped by throwing exception.
*/
public static class StateFulCharSequence implements CharSequence, Runnable{
private CharSequence inner;
private boolean isTimedOut = false;
public StateFulCharSequence(CharSequence inner) {
super();
this.inner = inner;
}
public char charAt(int index) {
if (isTimedOut) {
throw new RuntimeException(new TimeoutException("Pattern matching timeout occurs"));
}
return inner.charAt(index);
}
@Override
public int length() {
return inner.length();
}
@Override
public CharSequence subSequence(int start, int end) {
return new StateFulCharSequence(inner.subSequence(start, end));
}
@Override
public String toString() {
return inner.toString();
}
public void setTimedOut() {
this.isTimedOut = true;
}
@Override
public void run() {
this.isTimedOut = true;
}
}}
From Heritrix: (crawler.archive.org)
/**
* CharSequence that noticed thread interrupts -- as might be necessary
* to recover from a loose regex on unexpected challenging input.
*
* @author gojomo
*/
public class InterruptibleCharSequence implements CharSequence {
CharSequence inner;
// public long counter = 0;
public InterruptibleCharSequence(CharSequence inner) {
super();
this.inner = inner;
}
public char charAt(int index) {
if (Thread.interrupted()) { // clears flag if set
throw new RuntimeException(new InterruptedException());
}
// counter++;
return inner.charAt(index);
}
public int length() {
return inner.length();
}
public CharSequence subSequence(int start, int end) {
return new InterruptibleCharSequence(inner.subSequence(start, end));
}
@Override
public String toString() {
return inner.toString();
}
}
Wrap your CharSequence with this one and Thread interrupts will work ...