Regular Expression to Split String based on space and matching quotes in java

别来无恙 提交于 2019-12-02 14:33:16

问题


I have a String which i need to split based on the space and the exact matching quotes.

If the

string = "It is fun \"to write\" regular\"expression"

After the Split i want the result to be :

It

is

fun

"to write"

regular

"expression

The regular expression from which i came to some thing close to do this was :

STRING_SPLIT_REGEXP = "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'"

Thanks in advance for answers.


回答1:


It seems that you just used regex from this answer, but as you could see it doesn't use split but find method from Matcher class. Also this answer takes care of ' where your input shows no signs of it.

So you can improve this regex by removing parts handling ' which will make it look like

[^\\s\"]+|\"([^\"]*)\"

Also since you want to include " as part of token then you don't need to place match from between " in separate group, so get rid of parenthesis in \"([^\"]*)\" part

[^\\s\"]+|\"[^\"]*\"

Now all you need to do is add case where there will be no closing ", but instead you will get end of string. So change this regex to

[^\\s\"]+|\"[^\"]*(\"|$)

After this you can just use Matcher, find all store tokens somewhere, lets say in List.

Example:

String data = "It is fun \"to write\" regular\"expression";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"]+|\"[^\"]*(\"|$)");
Matcher regexMatcher = regex.matcher(data);
while (regexMatcher.find()) {
    System.out.println(regexMatcher.group());
    matchList.add(regexMatcher.group());
}

Output:

It
is
fun
"to write"
regular
"expression

More complex expression to handle handle this data can look like

String data = "It is fun \"to write\" regular \"expression";
for(String s : data.split("(?<!\\G)(?<=\\G[^\"]*(\"[^\"]{0,100000}\")?[^\"]*)((?<=\"(?!\\s))|\\s+|(?=\"))"))
    System.out.println(s);

but this approach is way overcomplicated then writing your own parser.


Such parser could look like

public static List<String> parse(String data) {
    List<String> tokens = new ArrayList<String>();
    StringBuilder sb = new StringBuilder();
    boolean insideQuote = false;
    char previous = '\0';

    for (char ch : data.toCharArray()) {
        if (ch == ' ' && !insideQuote) {
            if (sb.length() > 0 && previous != '"')
                addTokenAndResetBuilder(sb, tokens);
        } else if (ch == '"') {
            if (insideQuote) {
                sb.append(ch);
                addTokenAndResetBuilder(sb, tokens);
            } else {
                addTokenAndResetBuilder(sb, tokens);
                sb.append(ch);
            }
            insideQuote = !insideQuote;
        } else {
            sb.append(ch);
        }
        previous = ch;
    }
    addTokenAndResetBuilder(sb, tokens);

    return tokens;
}

private static void addTokenAndResetBuilder(StringBuilder sb, List<String> list) {
    if (sb.length() > 0) {
        list.add(sb.toString());
        sb.delete(0, sb.length());
    }
}

Usage

String data = "It is fun \"to write\" regular\"expression\"xxx\"yyy";
for (String s : parse(data))
    System.out.println(s);



回答2:


You are running into a fundamental limitation of regular expressions here. In general they can't detect recursion, depth, etc.

So in your string:

"It is fun \"to write\" regular\"expression"

Both the space between to and write and the space between \" and regular are all inside quote marks. Regex is not able to "count" the number of quotes in a flexible way and take action based on it.

You will need to write your own string parser for this (or use an existing one). Regex can't handle it though.




回答3:


The trick is to use a flexible look ahead to assert that:

  • if there's an even number of quotes in the input, there should be an even number following the space, because an odd number means the space is within quotes
  • if there's an odd number of quotes in the input, there should be an odd number following the space, because an even number means the space is within quotes

I got it into one line, but it's a whopper:

String[] parts = str.split("(\\s+|(?<!\\s)(?=\"))(?=(([^\"]*\"){2})*[^\"]*"
            + (str.matches("(([^\"]*\"){2})*[^\"]*") ? "" : "\"[^\"]*") + "$)");

This correctly splits the example string with or without the trailing quote (whether or not the trailing term includes a space).



来源:https://stackoverflow.com/questions/22416318/regular-expression-to-split-string-based-on-space-and-matching-quotes-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!