lookahead in the middle of regex doesn't match

后端 未结 1 719
孤街浪徒
孤街浪徒 2021-01-15 15:27

I have a string $s1 = \"a_b\"; and I want to match this string but only capture the letters. I tried to use a lookahead:

if($s1 =~ /([a-z])(?=_)         


        
相关标签:
1条回答
  • 2021-01-15 16:08

    A lookahead looks for next immediate positions and if a true-assertion takes place it backtracks to previous match - right after a - to continue matching. Your regex would work only if you bring a _ next to the positive lookahead ([a-z])(?=_)_([a-z])

    You even don't need (non-)capturing groups in substitution:

    if ($s1 =~ /([a-z])_([a-z])/) { print "Captured: $1, $2\n"; }
    

    Edit

    In reply to @Borodin's comment

    I think that moving backwards is the same as a backtrack which is more recognizable by debugging the whole thing (Perl debug mode):

    Matching REx "a(?=_)_b" against "a_b"
    .
    .
    .
       0 <> <a_b>                |   0| 1:EXACT <a>(3)
       1 <a> <_b>                |   0| 3:IFMATCH[0](9)
       1 <a> <_b>                |   1|  5:EXACT <_>(7)
       2 <a_> <b>                |   1|  7:SUCCEED(0)
                                 |   1|  subpattern success...
       1 <a> <_b>                |   0| 9:EXACT <_b>(11)
       3 <a_b> <>                |   0| 11:END(0)
    Match successful!
    

    As above debug output shows at forth line of results (when 3rd step took place) engine consumes characters a_ (while being in a lookahead assertion) and then we see a backtrack happens after successful assertion of positive lookahead, engine skips whole sub-pattern in a reverse manner and starts at the position right after a.

    At line #5, engine has consumed one character only: a. Regex101 debugger:

    How I interpret this backtrack is more clear in this illustration (Thanks to @JDB, I borrowed his style of representation)

    a(?=_)_b
    *
    |\
    | \
    |  : a (match)
    |  * (?=_)
    |  |↖
    |  | ↖
    |  |↘ ↖
    |  | ↘ ↖
    |  |  ↘ ↖
    |  |   : _ (match)
    |  |     ^ SUBPATTERN SUCCESS (OP_ASSERT :=> MATCH_MATCH)
    |  * _b
    |  |\
    |  | \
    |  |  : _ (match)
    |  |  : b (match)
    |  | /
    |  |/
    | /
    |/
    MATCHED
    

    By this I mean if lookahead assertion succeeds - since extraction of parts of input string is happened - it goes back upward (back to previous match offset - (eptr (pointer into the subject) is not changed but offset is) and while resetting consumed chars it tries to continue matching from there and I call it a backtrack. Below is a visual representation of steps taken by engine with use of Regexp::Debugger

    So I see it a backtrack or a kind of, however if I'm wrong with all these said, then I'd appreciate any reclaims with open arms.

    0 讨论(0)
提交回复
热议问题