Why/how is an additional variable needed in matching repeated arbitary character with capture groups?

前端 未结 3 707
故里飘歌
故里飘歌 2020-12-20 13:50

I\'m matching a sequence of a repeating arbitrary character, with a minimum length, using a perl6 regex.

After reading through https://docs.perl6.org/language/regex

相关标签:
3条回答
  • 2020-12-20 13:50

    Perl 6 regexes scale up to full grammars, which produce parse trees. Those parse trees are a tree of Match objects. Each capture - named or positional - is either a Match object or, if quantified, an array of Match objects.

    This is in general good, but does involve making the trade-off you have observed: once you are on the inside of a nested capturing element, then you are populating a new Match object, with its own set of positional and named captures. For example, if we do:

    say "abab" ~~ /((a)(b))+/
    

    Then the result is:

    「abab」
     0 => 「ab」
      0 => 「a」
      1 => 「b」
     0 => 「ab」
      0 => 「a」
      1 => 「b」
    

    And we can then index:

    say $0;        # The array of the top-level capture, which was quantified
    say $0[1];     # The second Match
    say $0[1][0];  # The first Match within that Match object (the (a))
    

    It is a departure from regex tradition, but also an important part of scaling up to larger parsing challenges.

    0 讨论(0)
  • 2020-12-20 13:56

    Option #1: Don't sub-capture a pattern that includes a back reference

    $0 is a back reference1.

    If you omit the sub-capture around the expression containing $0, then the code works:

    $_="bbaaaaawer"; / (.) $0**2..* / && print $/; # aaaaa
    

    Then you can also omit the {}. (I'll return to why you sometimes need to insert a {} later in this answer.)


    But perhaps you wrote a sub-capture around the expression containing the back reference because you thought you needed the sub-capture for some other later processing.

    There are often other ways to do things. In your example, perhaps you wanted a way to be able to count the number of repeats. If so, you could instead write:

    $_="bbaaaaawer";
    / (.) $0**2..* /;
    print $/.chars div $0.chars; # 5
    

    Job done, without the complications of the following sections.

    Option #2. Sub-capture without changing the current match object during matching of the pattern that includes a back reference

    Maybe you really need to sub-capture a match of an expression that includes a back reference.

    This can still be done without needing to surround the $0 with a sub-capture. This saves the problems discussed in the third section below.

    You can use this technique if you don't need to have sub-sub-captures of the expression and the expression isn't too complicated:

    $_="bbaaaaawer";
    / (.) $<capture-when-done>=$0**2..* /;
    print $<capture-when-done>.join; # aaaa
    

    This sub-captures the result of matching the expression in a named capture but avoids inserting an additional sub-capture context around the expression (which is what causes the complications discussed in the next section).

    Unfortunately, while this technique will work for the expression in your question ($0**2..*) it won't if an expression is complex enough to need grouping. This is because the syntax $<foo>=[...] doesn't work. Perhaps this is fixable without hurting performance or causing other problems.2

    Option #3. Use a saved back reference inside a sub-capture

    Finally we arrive at the technique you've used in your question.

    Automatically available back references to sub-captures (like $0) cannot refer to sub-captures that happened outside the sub-capture they're written in. Update See "I'm (at least half) wrong!" note below.

    So if, for any reason, you have to create a sub-capture (using either (...) or <...>) then you must manually store a back reference in a variable and use that instead.

    Before we get to a final section explaining in detail why you must use a variable, let's first complete an initial answer to your question by covering the final wrinkle.

    {} forces "publication" of match results thus far

    The {} is necessary to force the :my $c=$0; to update each time it's reached using the current regex/grammar engine. If you don't write it, then the regex engine fails to update $c to a capture of 'a' and instead leaves it stuck on a capture of 'b'.

    Please read "Publication" of match variables by Rakudo.

    Why can't a sub-capture include a back reference to captures that happened outside that sub-capture?

    First, you have to take into account that matching in P6 is optimized for the nested matching case syntactically, semantically, and implementation wise.

    In particular, if, when writing a regex or grammar, you write a numbered capture (with (...)), or a named rule/capture (with <foo>), then you've inserted a new level in a tree of sub-patterns that are dynamically matched/captured at run-time.

    See jnthn's answer for why and Brad's for some discussion of details.


    What I'll add to those answers is a (rough!) analogy, and another discussion of why you have to use a variable and {}.

    The analogy begins with a tree of sub-directories in a file system:

    /
      a
      b
        c
        d
    

    The analogy is such that:

    • The directory structure above corresponds to the result of a completed match operation.

    • After an overall match or grammar parse is complete, the match object $/ refers (analogously speaking) to the root directory.3

    • The sub-directories correspond to sub-captures of the match.

    • Numbered sub-matches/sub-captures $0 and $1 at the top level of the match operation shown below these bullet points corresponds to sub-directories a and b. The numbered sub-captures of the top level $1 sub-match/sub-capture corresponds to the c and d sub-directories.

    • During matching $/ refers to the "current match object" which corresponds to the "current working directory".

    • It's easy to refer to a sub-capture (sub-directory) of the current match (current working directory).

    • It's impossible to refer to a sub-capture (sub-directory) outside the current match (current working directory) unless you've saved a reference to that outside directory (capture) or a parent of it. That is, P6 does not include an analog of .. or /! Update I'm happy to report that I'm (at least half) wrong! See What's the difference between $/ and $¢ in regex?.

    If file system navigation didn't support these back references towards the root then one thing to do would be to create an environment variable that stored a particular path. That's roughly what saving a capture in a variable in a P6 regex is doing.

    The central issue is that a lot of the machinery related to regexes is relative to "the current match". And this includes $/, which refers to the current match and back references like $0, which are relative to the current match. Update See "I'm (at least half) wrong!" note above.


    Thus, in the following, which is runnable via tio.run here, it's easy to display 'bc' or 'c' with a code block inserted in the third pair of parens...

    $_="abcd";
    m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # 「bc」␤ 0 => 「c」␤
    say $/;                                      # 「abcd」␤ etc.
    

    ...but it's impossible to refer to the captured 「a」 in that third pair of parens without storing 「a」's capture in a regular variable. Update See "I'm (at least half) wrong!" note above.

    Here's one way of looking at the above match:

      ↓ Start TOP level $/
    m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # captures 「abcd」
    
        ↓ Start first sub-capture; TOP's $/[0]
       (                                    )    # captures 「abcd」
    
          ↓ Start first sub-sub-capture; TOP's $/[0][0]
         ( . )                                   # captures 「a」
    
                ↓ Start *second* sub-sub-capture; TOP's $/[0][1]
               (                          )      # captures 「bcd」
    
                    ↓ Start sub-sub-sub-capture; TOP's $/[0][1][0]
                   ( . )                         # captures 「c」
    
                         { say $/ }              # 「bc」␤ 0 => 「c」␤
    
                                     ( . )       # captures 'd'
    

    If we focus for a moment on what $/ refers to outside of the regex (and also directly inside the /.../ regex, but not inside sub-captures), then that $/ refers to the overall Match object, which ends up capturing 「abcd」. (In the filesystem analogy this particular $/ is the root directory.)

    The $/ inside the code block inside the second sub-sub-capture refers to a lower level match object, specifically the one that, at the point the say $/ is executed, has already matched 「bc」 and will go on to have captured 「bcd」 by the end of the overall match.

    But there's no built in way to refer to the sub-capture of 'a', or the overall capture (which at that point would be 'abc'), from within the sub-capture surrounding the code block. Update See "I'm (at least half) wrong!" note above.

    Hence you have to do something like what you've done.

    A possible improvement?

    What if there were a direct analog in P6 regexes for specifying the root? Update See "I'm (at least half) wrong!" note above.

    Here's an initial cut at this that might make sense. Let's define a grammar:

    my $*TOP;
    grammar g {
      token TOP { { $*TOP := $/ } (.) {} <foo> }
      token foo { <{$*TOP[0]}> }
    }
    say g.parse: 'aa' # 「aa」␤ 0 => 「a」␤ foo => 「a」
    

    So, perhaps a new variable could be introduced, one that's read only for userland code, that's bound to the overall match object during a match operation. Update See "I'm (at least half) wrong!" note above.

    But then that's not only pretty ugly (unable to use a convenient short-hand back reference like $0) but refocuses attention on the need to also insert a {}. And given that it would presumably be absurdly expensive to republish all the tree of match objects after each atom, one is brought full circle back to the current status quo. Short of the fixes mentioned in this answer, I think what is currently implemented is as good as it's likely to get.

    Footnotes

    1 The current P6 doc doesn't use the conventional regex term "back reference" but $0, $1 etc. are numbered P6 back references. The simplest explanation I've seen of numbered back references is this SO about them using a different regex dialect. In P6 they start with $ instead of \ and are numbered starting from 0 rather than 1. The equivalent of \0 in other regex dialects is $/ in P6. In addition, $0 is an alias for $/[0], $1 for $/[1], etc.

    2 One might think this would work, but it doesn't:

    $_="bbaaaaawer";
    / (.) $<doesn't-work>=[$0**2..*] /;
    print $<doesn't-work>.join; # Use of Nil in string context
    

    It seems that [...] doesn't mean "group, but don't insert a new capture level like (...) and <...> do" but instead "group, and do not capture". This renders the $<doesn't-work> in $<doesn't-work>=[$0**2..*] meaningless. Perhaps this can reasonably be fixed and perhaps it should be fixed.

    3 The current "match variable" doc says:

    $/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.

    (Fwiw $/ contains a List of Match objects if an adverb like :global or :exhaustive is used.)

    The above description ignores a very important use case for $/ which is its use during matching, in which case it contains the results so far of the current regex.

    Following our file system analogy, $/ is like the current working directory -- let's call it "the current working match object" aka CWMO. Outside a matching operation the CWMO ($/) is ordinarily the completed result of the last regex match or grammar parse. (I say "ordinarily" because it's writable so code can change it with as little as $/ = 42.) During matching (or actions) operations the CWMO is read-only for userland code and is bound to a Match object generated by the regex/grammar engine for the current match or action rule/method.

    0 讨论(0)
  • 2020-12-20 14:16

    The reason you have to store the capture into something other than $0 is that every capturing () creates a new set of numbered captures.

    So the $0 inside of ($0) can never refer to anything, because you didn't set $0 inside of the ().

    (The named captures $<foo> are also affected by this.)


    The following has 3 separate $0 “variables”, and one $1 “variable”:

    'aabbaabb' ~~ / ^ ( (.)$0 ((.)$0) ) $0 $ /
    
    'aabbaabb' ~~ /
                    ^
    
                    # $0 = 'aabb'
                    (
    
                      # $0 = 'a'
                      (.) $0
    
                      # $1 = 'bb'
                      (
    
                        # $0 = 'b'
                        (.) $0
                      )
                    )
    
                    $0
    
                    $
                  /
    
    「aabbaabb」
     0 => 「aabb」
      0 => 「a」
      1 => 「bb」
       0 => 「b」
    

    Basically the () in the regex DSL act a bit like {} in normal Perl6.

    A fairly direct if simplified translation of the above regex to “regular” Perl6 code follows.
    (Pay attention to the 3 lines with my $/ = [];)
    (Also the / ^ / style comments refer to the regex code for ^ and such above)

    given 'aabbaabb' {
        my $/ = [];      # give assignable storage for $0, $1 etc.
        my $pos = 0;     # position counter
        my $init = $pos; # initial position
    
        # / ^ /
        fail unless $pos == 0;
    
        # / ( /
        $0 = do {
            my $/ = [];
            my $init = $pos;
    
            # / (.) $0 /
            $0 = .substr($pos,1); # / (.) /
            $pos += $0.chars;
            fail unless .substr($pos,$0.chars) eq $0; # / $0 /
            $pos += $0.chars;
    
            # / ( /
            $1 = do {
                my $/ = [];
                my $init = $pos;
    
                # / (.) $0 /
                $0 = .substr($pos,1); # / (.) /
                $pos += $0.chars;
                fail unless .substr($pos,$0.chars) eq $0; # / $0 /
                $pos += $0.chars;
    
            # / ) /
                # the returned value (becomes $1 in outer scope)
               .substr($init, $pos - $init)
            }
    
        # / ) /
            # the returned value (becomes $0 in outer scope)
            .substr($init, $pos - $init)
        }
    
        # / $0 /
        fail unless .substr($pos,$0.chars) eq $0;
        $pos += $0.chars;
    
        # / $ /
        fail unless $pos = .chars;
    
        # the returned value
        .substr($init, $pos - $init)
    }
    

    TLDR;

    Just remove the () surrounding ($c) / ($0).
    (Assuming you didn't need the capture for something else.)

    /((.) $0**2..*)/
    
    perl6 -e '$_="bbaaaaawer"; /((.) $0**2..*)/ && put $0';
    
    0 讨论(0)
提交回复
热议问题