Java-8 regex negative lookbehind with `\R`

前端 未结 2 2059
青春惊慌失措
青春惊慌失措 2020-12-31 23:28

While answering another question, I wrote a regex to match all whitespace up to and including at most one newline. I did this using negative lookbehind for the \\R

相关标签:
2条回答
  • 2020-12-31 23:43

    The construct \R is a macro that surrounds the sub expressions into an atomic group (?> parts ).

    That's why it won't break them apart.

    A note: If Java accepts fixed alternations in a lookbehind, using \R is ok, but if the engine doesn't, this would throw an exception.

    0 讨论(0)
  • 2020-12-31 23:48

    Realization #1. The documentation is wrong

    Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

    Here it says:

    Linebreak matcher

    ...is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

    However, when we try using the "equivalent" pattern, it returns false:

    String _R_ = "\\R";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
    
    // using "equivalent" pattern
    _R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false
    
    // now make it atomic, as per sln's answer
    _R_ = "(?>"+_R_+")";
    System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true
    

    So the Javadoc should really say:

    ...is equivalent to (?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

    Update March 9, 2017 per Sherman at Oracle JDK-8176029:

    "api doc is NOT wrong, the implementation is wrong (which fails to backtracking "0x0d+next.match()" when "0x0d+0x0a + next.match()" fails)"


    Realization #2. Lookbehinds don't only look backwards

    Despite the name, a lookbehind is not only able to look backwards, but can include and even jump over the current position.

    Consider the following example (from rexegg.com):

    "_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_
    

    "This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic."

    What this means for our example of \R is that even though our current position may be \n, that will not stop the lookbehind from recognizing that its \r is followed by \n, then binding the two together as an atomic group, and consequently refusing to recognize the \r part behind the current position as a separate match.

    Note: for simplicity sake I have used terms such as "our current position is \n", however this is not an exact representation of what occurs internally.

    0 讨论(0)
提交回复
热议问题