Java repetitive pattern matching (2)

北城余情 提交于 2019-12-25 08:59:37

问题


Consider the following regex:

(([^\|])*\|)*([^\|]*)

This matches repetitive string patterns of the type

("whatever except |" |) {0 to any times} ("whatever except |" |) {1 time}

So it should match the following String, which has 17 substrings (16 repeated, plus " z" as the last one).

"abcd  | e | fg | hijk | lmnop | |   | qrs |   t| uv| w |||||x   y|  z"

Indeed, RegexPal verifies that the given regex does match the above string.

Now, I want to get each of the substrings (i.e., "abcd |", "e |", "fg |", etc.), for which there is no prior knowledge about their number, length etc.

According to a similarly-titled previous StackOverflow post and the documentation of the Matcher class find() method, I just need to do something like

Pattern pattern = Pattern.compile(regex); // regex is the above regex
Matcher matcher = pattern.matcher(input); // input is the above string

while (matcher.find())
{
   System.out.println(matcher.group(1));
}

However, when I do this I just get 2 strings printed out: the last repeated substring ("x y|") and a null value; definitely not the 16 substrings I expect.

A nice thing would also be to check that a match has actually happened, before running the find() loop, but I am not sure whether matches(), groupCount() > 0, or some other condition should be used, without doing twice the matching work, given that find() also does matching.

So, questions:

  1. How can I get all the 16 repeated substrings?
  2. How can I get the last substring?
  3. How do I check that the string matched?

回答1:


If you must use the regular expression...

1) How can I get all the 16 repeated substrings?

See below. When cycling over for matches, you don't need everything to match, just the section you want. (I get 17 matches--is this correct?)

2) How can I get the last substring?

Switching the delim to the start of the regex and also allowing '^'.

3) How do I check that the string matched?

What qualifies for a non-match? Any string will match.


Here is a solution using regular expressions:

String input = "abcd  | e | fg | hijk | lmnop | |   | qrs |   t| uv| w |||||x   y|  z";
int expectedSize = 17;
List<String> expected = new ArrayList<String>(Arrays.asList("abcd  ", " e ", " fg ", " hijk ", " lmnop ", " ", "   ", " qrs ", "   t", " uv", " w ", "",
    "", "", "", "x   y", "  z"));

List<String> matches = new ArrayList<String>();

// Pattern pattern = Pattern.compile("(?:\\||^)([^\\|]*)");
Pattern pattern = Pattern.compile("(?:_?\\||^)([^\\|]*?)(?=_?\\||$)"); // Edit: allows _| or | as delim

for (Matcher matcher = pattern.matcher(input); matcher.find();)
{
  matches.add(matcher.group(1));
}

for (int idx = 0, len = matches.size(); idx < len; idx++)
{
  System.out.format("[%-2d] \"%s\"%n", idx + 1, matches.get(idx));
}

assertSame(expectedSize, matches.size());
assertEquals(expected, matches);

Output

[1 ] "abcd  "
[2 ] " e "
[3 ] " fg "
[4 ] " hijk "
[5 ] " lmnop "
[6 ] " "
[7 ] "   "
[8 ] " qrs "
[9 ] "   t"
[10] " uv"
[11] " w "
[12] ""
[13] ""
[14] ""
[15] ""
[16] "x   y"
[17] "  z"



回答2:


I'm afraid you're confusing things. Whenever you use repetitions ('*', '+', etc.), you can't get all the instances matched. Using something like ((xxx)*) you can get the whole string matched as group(1) and the last part matched as group(2), nothing else.

Consider using String.split or better Guava's Splitter.


Ad 1. You can't. Use a simple pattern like

\G([^\|])*(\||$)

together with find() to get all the matches in sequence. Note the \G anchoring to a previous match.


Ad 2. How can I get the last substring?

As the last result find returns.


Ad 3. How do I check that the string matched?

After your last find check if matcher.end() == input.length. But with this pattern you don't need to check anything, as it always matches.



来源:https://stackoverflow.com/questions/7698499/java-repetitive-pattern-matching-2

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!