I\'m pretty sure I\'m missing something obvious here, but I cannot make R to use non-greedy regular expressions:
> library(stringr)
> str_match(\'xxx a
The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.
See the examples:
> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"
The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.
Scenario 1. Boundaries are single characters
In case a
and b
are single characters, the shortest window is found by using a negated character class. a[^ab]*b
will easily grab the substring from a
till the next b
with no a
s and b
s in between.
Scenario 2. Boundaries are not single characters
You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz
pattern matches xxx
, then any 0+ chars other than a linebreak char that is not the starting char of a xxx
or zzz
char sequence (the (?!xxx|zzz)
is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz
.
These matching scenarios can be easily used with base R regmatches
(using a PCRE regex flavor that supports lookaheads):
> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"
One note: when using a PCRE regex in base R, or the ICU regex in str_extract
/str_match
, the .
does not match linebreak characters, to enable that behavior, you need to add (?s)
at the pattern start (an inline DOTALL modifier).
Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.
Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab
, aaab
, aab
, and ab
are matches to your pattern, but aaaab
being the one that starts the most to the left is the one that is returned.
So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:
str_match('xxx aaaab yyy', "a.*?y")
# [,1]
# [1,] "aaaab y"
Here all of the strings aaaab y
, aaaab yy
, aaaab yyy
matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.
So what can you do to catch that last ab
? Use this:
str_match('xxx aaaab yyy', ".*(a.*b)")
# [,1] [,2]
# [1,] "xxx aaaab" "ab"
How does it work? By adding a greedy pattern .*
in the front, you are now forcing the process to put the last possible a
into the captured group.