Non-greedy regular expression match for multicharacter delimiters in awk

前端 未结 3 1508
后悔当初
后悔当初 2020-12-20 00:14

Consider the string \"AB 1 BA 2 AB 3 BA\". How can I match the content between \"AB\" and \"BA\" in a non-greedy fashion (in awk)?

相关标签:
3条回答
  • 2020-12-20 00:51

    For general expressions, I'm using this as a non-greedy match:

    function smatch(s, r) {
        if (match(s, r)) {
            m = RSTART
            do {
                n = RLENGTH
            } while (match(substr(s, m, n - 1), r))
            RSTART = m
            RLENGTH = n
            return RSTART
        } else return 0
    }
    

    smatch behaves like match, returning:

    the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.

    0 讨论(0)
  • 2020-12-20 00:56

    The other answer didn't really answer: how to match non-greedily? Looks like it can't be done in (G)AWK. The manual says this:

    awk (and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match.

    https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest

    And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.

    0 讨论(0)
  • 2020-12-20 01:00

    Merge your two negated character classes and remove the [^A] from the second alternation:

    regex = "AB([^AB]|B|[^B]A)*BA"
    

    This regex fails on the string ABABA, though - not sure if that is a problem.

    Explanation:

    AB       # Match AB
    (        # Group 1 (could also be non-capturing)
     [^AB]   # Match any character except A or B
    |        # or
     B       # Match B
    |        # or
     [^B]A   # Match any character except B, then A
    )*       # Repeat as needed
    BA       # Match BA
    

    Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.

    0 讨论(0)
提交回复
热议问题