While answering another question, I wrote a regex to match all whitespace up to and including at most one newline. I did this using negative lookbehind for the \\R
The construct \R
is a macro that surrounds the sub expressions into an atomic group (?> parts )
.
That's why it won't break them apart.
A note: If Java accepts fixed alternations in a lookbehind, using \R
is ok, but if the engine doesn't, this would throw an exception.
Realization #1. The documentation is wrong
Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
Here it says:
Linebreak matcher
...is equivalent to
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
However, when we try using the "equivalent" pattern, it returns false:
String _R_ = "\\R";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
// using "equivalent" pattern
_R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false
// now make it atomic, as per sln's answer
_R_ = "(?>"+_R_+")";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
So the Javadoc should really say:
...is equivalent to
(?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])
Update March 9, 2017 per Sherman at Oracle JDK-8176029:
"api doc is NOT wrong, the implementation is wrong (which fails to backtracking "0x0d+next.match()" when "0x0d+0x0a + next.match()" fails)"
Realization #2. Lookbehinds don't only look backwards
Despite the name, a lookbehind is not only able to look backwards, but can include and even jump over the current position.
Consider the following example (from rexegg.com):
"_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_
"This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic."
What this means for our example of \R
is that even though our current position may be \n
, that will not stop the lookbehind from recognizing that its \r
is followed by \n
, then binding the two together as an atomic group, and consequently refusing to recognize the \r
part behind the current position as a separate match.
Note: for simplicity sake I have used terms such as "our current position is \n
", however this is not an exact representation of what occurs internally.