How does the ? make a quantifier lazy in regex

后端 未结 4 1493
忘了有多久
忘了有多久 2020-12-06 07:56

I\'ve been looking into regex lately and figured that the ? operator makes the *,+, or ? lazy. My question is how does it

相关标签:
4条回答
  • 2020-12-06 08:38

    This very much depends on the implementation, I guess. But since every quantifier I am aware of can be modified with ? it might be reasonable to implement it that way.

    0 讨论(0)
  • 2020-12-06 08:44

    I think a little history will make it easier to understand. When the Larry Wall wanted to grow regex syntax to support new features, his options were severely limited. He couldn't just decree (for example) that % is now a metacharacter that supports new feature "XYZ". That would break the millions of existing regexes that happened to use % to match a literal percent sign.

    What he could do is take an already-defined metacharacter and use it in such a way that its original function wouldn't make sense. For example, any regex that contained two quantifiers in a row would be invalid, so it was safe to say a ? after another quantifier now turns it into a reluctant quantifier (a much better name than "lazy" IMO; non-greedy good too). So the answer to your question is that ? doesn't modify the *, *? is a single entity: a reluctant quantifier. The same is true of the + in possessive quantifiers (*+, {0,2}+ etc.).

    A similar process occurred with group syntax. It would never make sense to have a quantifier after an unescaped opening parenthesis, so it was safe to say (? now marks the beginning of a special group construct. But the question mark alone would only support one new feature, so the ? itself to be followed has to be followed by at least one more character to indicate which kind of group it is ((?:...), (?<!...), etc.). Again, the (?: is a single entity: the opening delimiter of a non-capturing group.

    I don't know offhand why he used the question mark both times. I do know Perl 6 Rules (a bottom-up rewrite of Perl 5 regexes) has done away with all that crap and uses an infinitely more sensible syntax.

    0 讨论(0)
  • 2020-12-06 08:52

    Imagine you have the following text:

    BAAAAAAAAD
    

    The following regexs will return:

    /B(A+)/ => 'BAAAAAAAA'
    /B(A+?)/ => 'BA'
    /B(A*)/ => 'BAAAAAAAA'
    /B(A*?)/ => 'B'
    

    The addition of the "?" to the + and * operators make them "lazy" - i.e. they will match the absolute minimum required for the expression to be true. Whereas by default the * and + operators are "greedy" and try and match AS MUCH AS POSSIBLE for the expression to be true.

    Remember + means "one or more" so the minimum will be "one if possible, more if absolutely necessary" whereas the maximum will be "all if possible, one if absolutely necessary".

    And * means "zero or more" so the minimum will be "nothing if possible, more if absolutely necessary" whereas the maximum will be "all if possible, zero if absolutely necessary".

    0 讨论(0)
  • 2020-12-06 08:54

    ? can mean a lot of different things in different contexts.

    • Following a normal regex token (a character, a shorthand, a character class, a group...), it means "Match the previous item 0-1 times".
    • Following a quantifier like ?, *, +, {n,m}, it takes on a different meaning: "Make the previous quantifier lazy instead of greedy (if that's the default; that can be changed, though - for example in PHP, the /U modifier makes all quantifiers lazy by default, so the additional ? makes them greedy).
    • Right after an opening parenthesis, it marks the start of a special construct like for example

      a) (?s): mode modifiers ("turn on dotall mode")
      b) (?:...): make the group non-capturing
      c) (?=...) or (?!...): lookahead assertion
      d) (?<=...) or (?<!...): lookbehind assertion
      e) (?>...): atomic group
      f) (?<foo>...): named capturing group
      g) (?#comment): inline comments, ignored by the regex engine
      h) (?(?=if)then|else): conditionals

    and others. Not all constructs are available in all regex flavors.

    • Within a character class ([?]), it simply matches a verbatim ?.
    0 讨论(0)
提交回复
热议问题