问题
public static final String PATTERN = "(?<=(^|,))(([^\",]+)|\"([^\"]*)\")(?=($|,))";
public static void main(String[] args) {
String line = ",1234,ABC";
Matcher matcher = Pattern.compile(PATTERN).matcher(line);
while (matcher.find()) {
if (matcher.group(3) != null) {
System.out.println(matcher.group(3));
} else {
System.out.println(matcher.group(4));
}
}
}
I used the above program to parse the string ",1234,ABC"
. After parsing I should get 3 tokens as follows:
- Empty String i.e ""
- 1234
- ABC
It seems to work on Java 1.6 but its not working on Java 1.5.
Regex are in java since Java 1.4 then why am I facing such problem?
回答1:
This is a bug in the Java Class Library (Sun's implementation, taken over by Oracle), at least up to JRE 1.5 Update 18 and before JRE 1.6 Update 32 (the 2 versions I tested on).
After some testing, there are some bugs in the implementation of positive look-behind (?<=pattern)
and also negative look-behind (?<!pattern)
1,2. Maybe it has something to do with how the engine backtracking when there are different width3 of the pattern separated by alternation |
, inside a look-behind non-capturing group.
Swapping the order of items in the look-behind sometimes work4, but appendix 2 shows that it may not work all the time.
For now, it seems like extracting alternation out of the look-behind is a possible solution. For example: a look-behind with alternation (?<=pat1|pat2|pat3)
is converted to (?:(?<=pat1)|(?<=pat2)|(?<=pat3))
. Repeat until there is no |
inside the look-behind. It seems to produce correct result for the test cases I used below.
So for the regex in question, this is the workaround (assuming the original one is correct):
"(?:^|(?<=,))(?:([^\",]+)|\"([^\"]*)\")(?:$|(?=,))"
Just in case there is problem with look-ahead, I also replace it with non-capturing group, since the result stays the same for your use case. (Testing has yet to reveal there is bug, but just in case.) Although I am not completely sure, I guess we can trust the engine to work correctly at least for (?<=,)
and (?=,)
. I also take the liberty to reduce the number of capturing groups, so please recount them.
Appendix
Tested with input string
",abc,1234"
and the regex"(?<=^|[,.])"
and"(?<!^|[,.])"
. The results were different between JRE 1.5u18 and JRE 1.6u32. For positive look-behind"(?<=^|[,.])"
, the match at position 1 is missing from the output of JRE 1.5u18, compare to that of JRE 1.6u32. Instead, for JRE 1.5u18, position 1 appears in the result for negative look-behind"(?<!^|[,.])"
, while output of JRE 1.6u32 doesn't contain it.It is not that much of a surprise to see this complementary behavior, as the positive and negative look-behind are exact opposite of each other.
Another test with input string
",abc,."
and the regex"(?<=,abc|[,.])"
. The match at position 1 does not appear in the list of result for JRE 1.5u18, compare to JRE 1.6u32.If we swap the alternation around:
"(?<=[,.]|,abc)"
, the match at position 4 is missing from the result of JRE 1.5u18, compare to JRE 1.6u32.May not limited to different width, but it is the case that I have tested.
I can make the regex in the question works on this input
",1234,ABC,\"sdfsdf,sdf\",sdfskhkf,"
by swapping^
and,
in the alternation, i.e. change(?<=(^|,))
to(?<=(,|^))
.
回答2:
String line = ",1234,ABC";
String[]arr= line.split(",");
System.out.println("arr.length = " + arr.length);
for(String s : arr)
{
System.out.println("s = \"" + s+"\"");
}
Output is:
arr.length = 3
s = ""
s = "1234"
s = "ABC"
来源:https://stackoverflow.com/questions/14414407/regex-not-working-in-java-1-5