问题
Lately I have being playing around with regex in Java, and I find myself into a problem which (theoretically) is easy to solve, but I was wandering if there is any easier way to do it (Yes, yes I am lazy), the problem is capture a group multiple times, this is:
public static void main(String[] args) {
Pattern p = Pattern.compile("A (IvI(.*?)IvI)*? A");
Matcher m = p.matcher("A IvI asd IvI IvI qwe IvI A"); //ANY NUMBER of IvI x IvI
//Matcher m = p.matcher("A A");
int loi = 0; //last Occurrence Index
String storage;
while (loi >= 0 && m.find(loi)) {
System.out.println(m.group(1));
if ((storage = m.group(2)) != null) {
System.out.println(storage);
}
//System.out.println(m.group(1));
loi = m.end(1);
}
m.find();
System.out.println("2 opt");
Pattern p2 = Pattern.compile("IvI(.*?)IvI");
Matcher m2 = p2.matcher(m.group(1)); //m.group(1) = "IvI asd IvI IvI qwe IvI"
loi = 0;
while (loi >= 0 && m2.find(loi)) {
if ((storage = m2.group(1)) != null) {
System.out.println(storage);
}
loi = m2.end(0);
}
}
Using ONLY Pattern p
is there any way to get what is inside IvI's
? (in the test string would be "asd" and "qwe") considering that there could be any number of IvI's
sections, something alike of what I am trying to do in the first while which is, finding the first occurrence of the group, then moving the index and search for the next group and so on and so on...
Using the code I wrote in that while it returns asd IvI IvI qwe
as the group 2, not just asd
and then qwe
, in part I suppose it could be because of the (.*?) part, is is not supposed to be greedy but still it goes up to the qwe
consuming two of the IvI's
, I mention this because otherwise I may be able to use the end index of those with the matcher.find(anInt)
method, but it does not work either; I don't think it is anything wrong with the regex, since the next code works without consuming the IvI
.
public static void main(String[] args) {
Pattern p = Pattern.compile("(.*?)IvI");
Matcher m = p.matcher("bla bla blaIvI");
m.find();
System.out.println(m.group(1));
}
This prints: bla bla bla
THERE IS A SOLUTION I KNOW (but I am lazy remember)
(Also on the first code, bellow "2 opt" message) The solution is dividing it into sub-groups and use another regex where you process only those sub-groups one at a time...
BTW: I did my homework In this page it mentions
Since a capture group with a quantifier holds on to its number, what value does the engine return when you inspect the group? All engines return the last value captured. For instance, if you match the string A_B_C_D_ with ([A-Z])+, when you inspect the match, Group 1 will be D. With the exception of the .NET engine, all intermediate values are lost. In essence, Group 1 gets overwritten each time its pattern is matched.
But I am still hoping you to give me good news...
回答1:
No, unfortunately, as your citation already mentions, the java.util.regex regular expression implementation does not support retrieving any previous values of a repeated capturing group after a single match. The only way to get those, as your code illustrates, is by find()ing multiple matches of the repeated part of your regular expression.
I've also been looking at other implementations of regular expressions in Java, for example:
- http://www.brics.dk/automaton/
but I could not find any that supported it (only the Microsoft .NET engine) . If I understood correctly, implementations of regular expressions based on state machines cannot easily implement this feature. java.util.regex does not use state machines, though.
If anyone knows of a Java regular expression library that supports this behaviour, please share it, because it would be a powerful feature.
p.s. it took me quite a while to understand your question. The title is good, but the body confused me about whether I understood you correctly.
来源:https://stackoverflow.com/questions/26773829/capture-group-multiple-times