Balancing groups in variable-length lookbehind [duplicate]

后端未结

关注

 2  490

陌清茗

相关标签:

2条回答

一个人的身影

2020-11-27 07:47
I think I got it.
First, as I mentioned in one of the comments, (?<=(?<A>.)(?<-A>.)) never matches.
But then I thought, what about (?<=(?<-A>.)(?<A>.))? It does match!
And how about (?<=(?<A>.)(?<A>.))? Matched against "12", A is captures "1", and if we look at the Captures collection, it is {"2", "1"} - first two, then one - it is reversed.
So, while inside a lookbehind, .net matches and captures from the right to the left.

Now, how can we make it capture from left to right? This is quite simple, really - we can trick the engine using a lookahead:
```
(?<=(?=(?<A>.)(?<A>.))..)
```
Applied to your original patten, the simplest option I came up with was:
```
(?<=
    ~[(]
    (?=
        (?:
            [^()]
            |
            (?<Depth>[(])
            |
            (?<-Depth>[)])
        )*
        (?<=(\k<Prefix>))   # Make sure we matched until the current position
    )
    (?<Prefix>.*)           # This is captured BEFORE getting to the lookahead
)
[a-z]
```
The challenge here was that now the balanced part may end anywhere, so we make it reach all the way to the current position (Something like \G or \Z would be useful here, but I don't think .net has that)

It is very possible this behavior is documented somewhere, I'll try to look it up.

Here's another approach. The idea is simple - .net wants to match from right to left? Fine! Take that:
(tip: start reading from the bottom - that is how .net does it)
```
(?<=
    (?(Depth)(?!))  # 4. Finally, make sure there are no extra closed parentheses.
    ~\(
    (?>                     # (non backtracking)
        [^()]               # 3. Allow any other character
        |
        \( (?<-Depth>)?     # 2. When seeing an open paren, decreace depth.
                            #    Also allow excess parentheses: '~((((((a' is OK.
        |
        (?<Depth>  \) )     # 1. When seeing a closed paren, add to depth.
    )*
)
\w                          # Match your letter
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

再見小時候

2020-11-27 07:55

I think the problem is with the data and not the pattern. The data has 'Post' items which need to be matched such as

(a b ( c ) d e f )

where d e and f are needed to be matched. A more balanced data would be

(a b (c)(d)(e)(f))

So the tack I took on this example data required a post match situation after braces:

~(a b (c) d (e f (g) h) i) j k

where j & k should be ignored...my pattern failed and captured them.

The interesting thing is that I named the captures groups to find out where they came in and j and k came in in capture three. I leave you with, not an answer, but an attempt to see if you can improve on it.

(~                         # Anchor to a Tilde
 (                         # Note that \x28 is ( and \x29 is )      
  (                          # --- PRE ---
     (?<Paren>\x28)+          # Push on a match into Paren
     ((?<Char1>[^\x28\x29])(?:\s?))*
   )+                         # Represents Sub Group 1
  (                           #---- Closing
   ((?<Char2>[^\x28\x29])(?:\s?))*
   (?<-Paren>\x29)+           # Pop off a match from Paren

  )+  
  (
     ((?<Char3>[^\x28\x29])(?:\s?))*   # Post match possibilities
  )+

 )+
(?(Paren)(?!))    # Stop after there are not parenthesis    
)

Here is the match broken out with a tool I have created on my own (maybe one day I will publish). Note that the ˽ shows where a space was matched.

Match #0
               [0]:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       ["1"] → [1]:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       →1 Captures:  ~(a˽b˽(c)˽d˽(e˽f˽(g)˽h)˽i)˽j˽k
       ["2"] → [2]:  (e˽f˽(g)˽h)˽i)˽j˽k
       →2 Captures:  (a˽b˽(c)˽d˽, (e˽f˽(g)˽h)˽i)˽j˽k
       ["3"] → [3]:  (g
       →3 Captures:  (a˽b˽, (c, (e˽f˽, (g
       ["4"] → [4]:  g
       →4 Captures:  a˽, b˽, c, e˽, f˽, g
       ["5"] → [5]:  ˽i)
       →5 Captures:  ), ), ˽h), ˽i)
       ["6"] → [6]:  i
       →6 Captures:  ˽, h, ˽, i
       ["7"] → [7]:  
       →7 Captures:  ˽d˽, , ˽j˽k, 
       ["8"] → [8]:  k
       →8 Captures:  ˽, d˽, ˽, j˽, k
   ["Paren"] → [9]:  
  ["Char1"] → [10]:  g
      →10 Captures:  a, b, c, e, f, g
  ["Char2"] → [11]:  i
      →11 Captures:  ˽, h, ˽, i
  ["Char3"] → [12]:  k
      →12 Captures:  ˽, d, ˽, j, k

0 讨论(0)

热议问题