How does {m}{n} (“exactly n times” twice) work?

前端 未结 7 504
自闭症患者
自闭症患者 2021-02-03 16:23

So, some way or another (playing around), I found myself with a regex like \\d{1}{2}.

Logically, to me, it should mean:

(A digit exac

相关标签:
7条回答
  • 2021-02-03 17:03

    Scientific approach:
    click on the patterns to see the example on regexplanet.com, and click on the green Java button.

    • You've already showed \d{1}{2} matches "1", and doesn't match "12", so we know it isn't interpreted as (?:\d{1}){2}.
    • Still, 1 is a boring number, and {1} might be optimized away, lets try something more interesting:
      \d{2}{3}. This still only matches two characters (not six), {3} is ignored.
    • Ok. There's an easy way to see what a regex engine does. Does it capture?
      Lets try (\d{1})({2}). Oddly, this works. The second group, $2, captures the empty string.
    • So why do we need the first group? How about ({1})? Still works.
    • And just {1}? No problem there.
      It looks like Java is being a little weird here.
    • Great! So {1} is valid. We know Java expands * and + to {0,0x7FFFFFFF} and {1,0x7FFFFFFF}, so will * or + work? No:

      Dangling meta character '+' near index 0
      +
      ^

      The validation must come before * and + are expanded.

    I didn't find anything in the spec that explains that, it looks like a quantifier must come at least after a character, brackets, or parentheses.

    Most of these patterns are considered invalid by other regex flavors, and for a good reason - they do not make sense.

    0 讨论(0)
  • 2021-02-03 17:04

    At first I was surprised this doesn't throw a PatternSyntaxException.

    I can't base my answer on any facts, so this is just an educated guess:

    "\\d{1}"    // matches a single digit
    "\\d{1}{2}" // matches a single digit followed by two empty strings
    
    0 讨论(0)
  • 2021-02-03 17:07

    I have never seen the {m}{n} syntax anywhere. It seems that the regex engine on this Rubular page applies the {2} quantifier to the smallest possible token before that - which is \\d{1}. To mimick this in Java (or most other regex engines, it would seem), you need to group the \\d{1} like so:

    ^(\\d{1}){2}$
    

    See it in action here.

    0 讨论(0)
  • 2021-02-03 17:11

    I am guessing that in definition of {} is something like "look back to find valid expression (excluding myself - {}", so in your example there is nothing between } and {.

    Anyway, if you wrap it in parenthesis it will work as you expected: http://refiddle.com/gv6.

    0 讨论(0)
  • 2021-02-03 17:17

    Compiled structure of the regex

    Kobi's answer is spot on about the behavior of Java regex (Sun/Oracle implementation) for the case "^\\d{1}{2}$", or "{1}".

    Below is the internal compiled structure of "^\\d{1}{2}$":

    ^\d{1}{2}$
    Begin. \A or default ^
    Curly. Greedy quantifier {1,1}
      Ctype. POSIX (US-ASCII): DIGIT
      Node. Accept match
    Curly. Greedy quantifier {2,2}
      Slice. (length=0)
    
      Node. Accept match
    Dollar(multiline=false). \Z or default $
    java.util.regex.Pattern$LastNode
    Node. Accept match
    

    Looking at the source code

    From my investigation, the bug is probably due to that fact that { is not properly checked in the private method sequence().

    The method sequence() calls to the atom() to parse the atom, then attach quantifier to the atom by calling closure(), and chains all atoms-with-closure together into one sequence.

    For example, given this regex:

    ^\d{4}a(bc|gh)+d*$
    

    Then the top-level call to sequence() will receive the compiled nodes for ^, \d{4}, a, (bc|gh)+, d*, $ and chain them together.

    With that idea in mind, let us look at the source code of sequence(), copied from OpenJDK 8-b132 (Oracle uses the same code base):

    @SuppressWarnings("fallthrough")
    /**
     * Parsing of sequences between alternations.
     */
    private Node sequence(Node end) {
        Node head = null;
        Node tail = null;
        Node node = null;
    LOOP:
        for (;;) {
            int ch = peek();
            switch (ch) {
            case '(':
                // Because group handles its own closure,
                // we need to treat it differently
                node = group0();
                // Check for comment or flag group
                if (node == null)
                    continue;
                if (head == null)
                    head = node;
                else
                    tail.next = node;
                // Double return: Tail was returned in root
                tail = root;
                continue;
            case '[':
                node = clazz(true);
                break;
            case '\\':
                ch = nextEscaped();
                if (ch == 'p' || ch == 'P') {
                    boolean oneLetter = true;
                    boolean comp = (ch == 'P');
                    ch = next(); // Consume { if present
                    if (ch != '{') {
                        unread();
                    } else {
                        oneLetter = false;
                    }
                    node = family(oneLetter, comp);
                } else {
                    unread();
                    node = atom();
                }
                break;
            case '^':
                next();
                if (has(MULTILINE)) {
                    if (has(UNIX_LINES))
                        node = new UnixCaret();
                    else
                        node = new Caret();
                } else {
                    node = new Begin();
                }
                break;
            case '$':
                next();
                if (has(UNIX_LINES))
                    node = new UnixDollar(has(MULTILINE));
                else
                    node = new Dollar(has(MULTILINE));
                break;
            case '.':
                next();
                if (has(DOTALL)) {
                    node = new All();
                } else {
                    if (has(UNIX_LINES))
                        node = new UnixDot();
                    else {
                        node = new Dot();
                    }
                }
                break;
            case '|':
            case ')':
                break LOOP;
            case ']': // Now interpreting dangling ] and } as literals
            case '}':
                node = atom();
                break;
            case '?':
            case '*':
            case '+':
                next();
                throw error("Dangling meta character '" + ((char)ch) + "'");
            case 0:
                if (cursor >= patternLength) {
                    break LOOP;
                }
                // Fall through
            default:
                node = atom();
                break;
            }
    
            node = closure(node);
    
            if (head == null) {
                head = tail = node;
            } else {
                tail.next = node;
                tail = node;
            }
        }
        if (head == null) {
            return end;
        }
        tail.next = end;
        root = tail;      //double return
        return head;
    }
    

    Take note of the line throw error("Dangling meta character '" + ((char)ch) + "'");. This is where the error is thrown if +, *, ? are dangling and is not part of a preceding token. As you can see, { is not among the cases to throw error. In fact, it is not present in the list of cases in sequence(), and the compilation process will go by default case directly to atom().

    @SuppressWarnings("fallthrough")
    /**
     * Parse and add a new Single or Slice.
     */
    private Node atom() {
        int first = 0;
        int prev = -1;
        boolean hasSupplementary = false;
        int ch = peek();
        for (;;) {
            switch (ch) {
            case '*':
            case '+':
            case '?':
            case '{':
                if (first > 1) {
                    cursor = prev;    // Unwind one character
                    first--;
                }
                break;
            // Irrelevant cases omitted
            // [...]
            }
            break;
        }
        if (first == 1) {
            return newSingle(buffer[0]);
        } else {
            return newSlice(buffer, first, hasSupplementary);
        }
    }
    

    When the process enters atom(), since it encounters { right away, it breaks from switch and for loop, and a new slice with length 0 is created (the length comes from first, which is 0).

    When this slice is returned, the quantifier is parsed by closure(), resulting in what we see.

    Comparing the source code of Java 1.4.0, Java 5 and Java 8, there doesn't seem to be much changes in the source code of sequence() and atom(). It seems this bug has been there since the beginning.

    Standard for regular expression

    The top-voted answer citing IEEE-Standard 1003.1 (or POSIX standard) is irrelevant to the discussion, since Java does not implement BRE and ERE.

    There are many syntax resulting in undefined behavior according to the standard, but is well-defined behavior across many other regex flavors (though whether they agree or not is another matter). For example, \d is undefined according to the standard, but it matches digits (ASCII/Unicode) in many regex flavors.

    Sadly, there is no other standard on regular expression syntax.

    There is, however, a standard on Unicode Regular Expression, which focuses on features a Unicode regex engine should have. Java Pattern class more or less implements Level 1 support as described in UTS #18: Unicode Regular Expression and RL2.1 (albeit extremely buggy).

    0 讨论(0)
  • 2021-02-03 17:19

    IEEE-Standard 1003.1 says:

    The behavior of multiple adjacent duplication symbols ( '*' and intervals) produces undefined results.

    So every implementation can do as it pleases, just don't rely on anything specific...

    0 讨论(0)
提交回复
热议问题