Consider the string \"AB 1 BA 2 AB 3 BA\"
. How can I match the content between \"AB\"
and \"BA\"
in a non-greedy fashion (in awk)?
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch
behaves like match
, returning:
the position in
s
where the regular expressionr
occurs, or 0 if it does not. The variablesRSTART
andRLENGTH
are set to the position and length of the matched string.
The other answer didn't really answer: how to match non-greedily? Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
Merge your two negated character classes and remove the [^A]
from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA
, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A
in the alternation is by matching a character except B
before it, we can safely use the simple B
as one of the alternatives.