The code
String s = \"y z a a a b c c z\";
Pattern p = Pattern.compile(\"(a )+(b )+(c *)c\");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.pri
You will need a lazy quantifier.
Please try the following:
Pattern p = Pattern.compile("(a )+(b )+((c )*?)c");
Please also notice, that I grouped "c
" once again, since I think that's what you want. Otherwise you would find arbitrarily many spaces, but not "c
".
I don't know of any regex engines that can give back all valid matches.
But we can apply a bit of logic to generate all candidates string and present it to the regex.
A candidate is constructed by enumerating all possible substring of a given input.
var str = "y z a a a b c c z y z a a a b c c z";
var regex = new Regex("(a )+(b )+(c *)c");
var length = str.Length;
for (int start = 1; start <= length;start++){
for (int groupLength = 1; start + groupLength - 1 <= length ;groupLength++){
var candidate = str.Substring(start-1,groupLength); //.Dump();
//("\"" + candidate + "\"").Dump();
var match = regex.Match(candidate);
if (match.Value == candidate )
{
candidate.Dump();
}
}
}
This gives
a a a b c c
a a b c c
a b c c
which seems the correct answer but contradicts your result :
a a a b c => I state that this is not a match
a a b c c ok
a a b c => I state that this is not a match
a b c c ok
a b c => I state that this is not a match
For example, the regex that you give
(a )+(b )+(c *)c
does not match the first entry in your result
a a a b c
The logic above can generate identical matches if you consider starting position not important. For example if you just repeat the given input another time :
"y z a a a b c c z y z a a a b c c z"
It will give :
a a a b c c
a a b c c
a b c c
a a a b c c
a a b c c
a b c c
If you consider position not important you should do a distinct on this result
The trivial case where the input is the empty string should als be added if considered a potential match.
FYI, this are all the candidates that the regex examines
"y"
"y "
"y z"
"y z "
"y z a"
"y z a "
"y z a a"
"y z a a "
"y z a a a"
"y z a a a "
"y z a a a b"
"y z a a a b "
"y z a a a b c"
"y z a a a b c "
"y z a a a b c c"
"y z a a a b c c "
"y z a a a b c c z"
" "
" z"
" z "
" z a"
" z a "
" z a a"
" z a a "
" z a a a"
" z a a a "
" z a a a b"
" z a a a b "
" z a a a b c"
" z a a a b c "
" z a a a b c c"
" z a a a b c c "
" z a a a b c c z"
"z"
"z "
"z a"
"z a "
"z a a"
"z a a "
"z a a a"
"z a a a "
"z a a a b"
"z a a a b "
"z a a a b c"
"z a a a b c "
"z a a a b c c"
"z a a a b c c "
"z a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a a"
" a a a "
" a a a b"
" a a a b "
" a a a b c"
" a a a b c "
" a a a b c c"
" a a a b c c "
" a a a b c c z"
"a"
"a "
"a a"
"a a "
"a a a"
"a a a "
"a a a b"
"a a a b "
"a a a b c"
"a a a b c "
"a a a b c c"
"a a a b c c "
"a a a b c c z"
" "
" a"
" a "
" a a"
" a a "
" a a b"
" a a b "
" a a b c"
" a a b c "
" a a b c c"
" a a b c c "
" a a b c c z"
"a"
"a "
"a a"
"a a "
"a a b"
"a a b "
"a a b c"
"a a b c "
"a a b c c"
"a a b c c "
"a a b c c z"
" "
" a"
" a "
" a b"
" a b "
" a b c"
" a b c "
" a b c c"
" a b c c "
" a b c c z"
"a"
"a "
"a b"
"a b "
"a b c"
"a b c "
"a b c c"
"a b c c "
"a b c c z"
" "
" b"
" b "
" b c"
" b c "
" b c c"
" b c c "
" b c c z"
"b"
"b "
"b c"
"b c "
"b c c"
"b c c "
"b c c z"
" "
" c"
" c "
" c c"
" c c "
" c c z"
"c"
"c "
"c c"
"c c "
"c c z"
" "
" c"
" c "
" c z"
"c"
"c "
"c z"
" "
" z"
"z"
Also it's good to know how the 2 main types of regexes (NFA and DFA) do their work
from http://msdn.microsoft.com/en-us/library/e347654k.aspx
.NET (and JAVA too I think) are NFA regex engines (as opposed to DFA) and as it processes a particular language element, the engine uses greedy matching; that is, it matches as much of the input string as it possibly can. But it also saves its state after successfully matching a subexpression. If a match eventually fails, the engine can return to a saved state so it can try additional matches. This process of abandoning a successful subexpression match so that later language elements in the regular expression can also match is known as backtracking.
The only way I can think of here would be to generate a list of all possible substrings of your original string and match the regex against each of them, retaining those items where it matched.
Given these very specific constraints (i.e. this is not a general case solution), this will work:
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String s = "y z a a a b c c z";
Pattern p = Pattern.compile("(a )+(b )+(c ?)+");
Set<String> set = recurse(s, p, 0);
}
public static Set<String> recurse(String s, Pattern p, int depth) {
int temp = depth;
while(temp>0) {
System.out.print(" ");
temp--;
}
System.out.println("-> " +s);
Matcher matcher = p.matcher(s);
Set<String> set = new TreeSet<String>();
if(matcher.find()) {
String found = matcher.group().trim();
set.add(found);
set.addAll(recurse(found.substring(1), p, depth+1));
set.addAll(recurse(found.substring(0, found.length()-1), p, depth+1));
}
while(depth>0) {
System.out.print(" ");
depth--;
}
System.out.println("<- " +s);
return set;
}
}
I'm reasonably sure you can adapt it to work on other cases, but the recursion into the matched string means that overlapping matches (like the one pointed out by @ahenderson) won't work.
You can use the reluctant qualifiers such as *?
and +?
. These match as little as possible, in contrast to the standard *
and +
which are greedy, i.e. match as much as possible. Still, this only allows you to find particular "sub-matches", not all of them. Some more control can be achieved using lookahead controlling non-capturing groups, also described in the docs. But in order to really find all sub-matches, you would probably have to do stuff yourself, i.e. build the automaton to which the regex corresponds and navigate it using custom code.