Regex not working in java 1.5

ぐ巨炮叔叔 提交于 2020-01-03 03:40:52

问题


public static final String PATTERN = "(?<=(^|,))(([^\",]+)|\"([^\"]*)\")(?=($|,))";
public static void main(String[] args) {
    String line = ",1234,ABC";
    Matcher matcher = Pattern.compile(PATTERN).matcher(line);
    while (matcher.find()) {
        if (matcher.group(3) != null) {
            System.out.println(matcher.group(3));
        } else {
            System.out.println(matcher.group(4));
        }
    }
}

I used the above program to parse the string ",1234,ABC". After parsing I should get 3 tokens as follows:

  1. Empty String i.e ""
  2. 1234
  3. ABC

It seems to work on Java 1.6 but its not working on Java 1.5.

Regex are in java since Java 1.4 then why am I facing such problem?


回答1:


This is a bug in the Java Class Library (Sun's implementation, taken over by Oracle), at least up to JRE 1.5 Update 18 and before JRE 1.6 Update 32 (the 2 versions I tested on).

After some testing, there are some bugs in the implementation of positive look-behind (?<=pattern) and also negative look-behind (?<!pattern)1,2. Maybe it has something to do with how the engine backtracking when there are different width3 of the pattern separated by alternation |, inside a look-behind non-capturing group.

Swapping the order of items in the look-behind sometimes work4, but appendix 2 shows that it may not work all the time.

For now, it seems like extracting alternation out of the look-behind is a possible solution. For example: a look-behind with alternation (?<=pat1|pat2|pat3) is converted to (?:(?<=pat1)|(?<=pat2)|(?<=pat3)). Repeat until there is no | inside the look-behind. It seems to produce correct result for the test cases I used below.

So for the regex in question, this is the workaround (assuming the original one is correct):

"(?:^|(?<=,))(?:([^\",]+)|\"([^\"]*)\")(?:$|(?=,))"

Just in case there is problem with look-ahead, I also replace it with non-capturing group, since the result stays the same for your use case. (Testing has yet to reveal there is bug, but just in case.) Although I am not completely sure, I guess we can trust the engine to work correctly at least for (?<=,) and (?=,). I also take the liberty to reduce the number of capturing groups, so please recount them.

Appendix

  1. Tested with input string ",abc,1234" and the regex "(?<=^|[,.])" and "(?<!^|[,.])". The results were different between JRE 1.5u18 and JRE 1.6u32. For positive look-behind "(?<=^|[,.])", the match at position 1 is missing from the output of JRE 1.5u18, compare to that of JRE 1.6u32. Instead, for JRE 1.5u18, position 1 appears in the result for negative look-behind "(?<!^|[,.])", while output of JRE 1.6u32 doesn't contain it.

    It is not that much of a surprise to see this complementary behavior, as the positive and negative look-behind are exact opposite of each other.

  2. Another test with input string ",abc,." and the regex "(?<=,abc|[,.])". The match at position 1 does not appear in the list of result for JRE 1.5u18, compare to JRE 1.6u32.

    If we swap the alternation around: "(?<=[,.]|,abc)", the match at position 4 is missing from the result of JRE 1.5u18, compare to JRE 1.6u32.

  3. May not limited to different width, but it is the case that I have tested.

  4. I can make the regex in the question works on this input ",1234,ABC,\"sdfsdf,sdf\",sdfskhkf," by swapping ^ and , in the alternation, i.e. change (?<=(^|,)) to (?<=(,|^)).




回答2:


String line = ",1234,ABC";
String[]arr= line.split(",");
System.out.println("arr.length = " + arr.length);
for(String s : arr)
{
   System.out.println("s = \"" + s+"\"");
}

Output is:

arr.length = 3
s = ""
s = "1234"
s = "ABC"


来源:https://stackoverflow.com/questions/14414407/regex-not-working-in-java-1-5

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!