问题
I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).
Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback
in PHP) fail in Java with the following exception:
java.util.regex.PatternSyntaxException: Unclosed character class near index 217
The statement is obvious, the solution is elusive.
Here's the raw, multiline regex from the PHP implementation:
return preg_replace_callback('/
(^|(?<=[\s>.\(])|[{[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\(([^)]+?)\)(?="))? # $title
":
('.$this->urlch.'+?) # $url
(\/)? # $slash
([^\w\/;]*?) # $post
([\]}]|(?=\s|$|\)))
/x',callback,input);
Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo
that resulted in the following, rather long, regular expression:
(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))
I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.
I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.
Any ideas?
I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.
In #title
switch the escaped paren:
(?:\(([^)]+?)\)(?="))? # $title
...^
(?:(\([^)]+?)\)(?="))? # $title
....^
Thanks, Tim
edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...
"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"
回答1:
@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []]
or [^]]
are okay because the ]
is the first character other than the negating ^
, but in Java an unescaped [
anywhere in a character class is a syntax error.
Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes /
because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:
"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"
Whether it's the best regex I have no idea, not knowing how it's being used.
回答2:
I'm not sure exactly where your problem lies, but this might help:
In Java (and I believe this is unique to Java), the [
symbol (not just the ]
symbol) is reserved inside character classes and needs to be escaped.
The revised expression should probably be similar to the following, in order to be Java-compatible:
(^|(?<=[\s>.\(])|[{\[]) # $pre
" # start
(' . $this->c . ') # $atts
([^"]+?) # $text
(?:\(([^)]+?)\)(?="))? # $title
":
('.$this->urlch.'+?) # $url
(\/)? # $slash
([^\w\/;]*?) # $post
([\]}]|(?=\s|$|\)))
/x
Basically, any place where most regex flavors will allow a character class like [a-z,;[\]+-]
- which would match "either a letter a
-z
or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be [a-z,;\[\]+-]
(escape the [
with a \
character)
This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.
来源:https://stackoverflow.com/questions/8126339/unclosed-character-class-near-index-nnn