I have an issue with lazy quantifiers. Or most likely I misunderstand how I am supposed to use them.
Testing on Regex101
My test string is let\'s say: 12345678
If you have a string containing numbers followed by a non-numeric, the minimum set of {1,5}? would always be 1 (so it's not necessary to have the range.) I don't think the lazy operator is actually working as we think on the numeric range.
If you make the first \d+ greedy, as below you'll get the minimum number of digits before the D.
(\d+)(\d{1,5}D)
Matches 9D in the second group
If you make the first set of numbers lazy, then you'll get the maximum number of digits (5)
(\d+?)(\d{1,5}D)
Matches 56789D in the second group
I think these regex expressions might be more in line with what you need.
Your regex is
.{1,5}?D
matches
123456789D123456789
------
But you said you expected 9D
because using of "non-greedy quantifier".
Anyway, how about this?
D.{1,5}?
What is a result of matching?
Yes! as you expected it matches
123456789D123456789
--
So, WHY?
OK, The first, I think you need to understand that normally regex engine will read characters from left to right-hand side of an input string. Considering your example which using non-greedy quantifier, once engine is matched
123456789D123456789
------
It will not go further to
123456789D123456789
-----
123456789D123456789
----
...
123456789D123456789
--
Because regex engine will evaluate text as less as possible, this is why it also called "Lazy quantifiers".
And it also work in the same way on my regex D.{1,5}?
which should not go further to
123456789D123456789
---
123456789D123456789
----
...
123456789D123456789
------
But stop at the first match
123456789D123456789
--
First and foremost, please do not think of greediness and laziness in regex as means of getting the longest/shortest match. "Greedy" and "lazy" terms only pertain to the rightmost character a pattern can match, it does not have any impact on the leftmost one. When you use a lazy quantifier, it will guarantee that the end of your matched substring will be the first found one, not the last found one (that would be returned with a greedy quantifier).
The regex engine analyzes a string from left to right. So, it searches for the first character that meets the pattern and then, once it finds the matching substring, it is returned as a match.
Let's see how it parses the string with .{1,5}D
: 1
is found and D
is tested for. No D
after 1
is found, the regex engine expands the lazy quantifier and matches 12
and tries to match D
. There is 3
after 2
, again, the engine expands the lazy dot and does it 5 times. After expanding to the max value, it sees there is 12345
and the next character is not D
. Since the engine reached the max limiting quantifier value, the match is failed, next location is tested.
The same scenario happens with the locations up to 5
. When the engine reaches 5
, it tries to match 5D
, fails, tries 56D
, fails, 567D
, fails, 5678D
- fails again, and when it tries to match 56789D
- Bingo! - the match is found.
This makes it clear that a lazily quantified subpattern at the beginning of a pattern will act "greedily" by default, that is, it will not match the shortest substring.
Here is a visualization from regex101.com:
Now, here is a fun fact: .{1,5}?
at the end of the pattern will always match 1 character (if there is any) because the requirement is to match at least 1, and it is sufficient to return a valid match. So, if you write D.{1,5}?, you will get D1
and D6
in 123456789D12345D678904
.
Fun Fact 2: In .NET, you can "ask" the regex engine to analyze the string from right to left with the help of RightToLeft
modifier. Then, with .{1,5}?D
, you will get 9D
, see this demo.
Fun fact 3: In .NET, (?<=(.{1,5}?))D
will capture 9
into Group 1 if 123456789D
is passed as input. This happens because of the way the lookbehind is implemented in .NET regex (.NET reverses the string as well as the pattern inside the lookbehind, then attempts to match that single pattern on the reversed string). And in Java, (?<=(.{1,5}))D
(the greedy version) will capture 9
because it tries all the possible fixed-width patterns in the range, from the shortest to the longest, until one succeeds.
And a solution is: if you know you need 1 character followed with D
, just use
/.D/