Regular Expression to Split String based on space and matching quotes in java

后端 未结 3 684
一整个雨季
一整个雨季 2021-01-27 18:54

I have a String which i need to split based on the space and the exact matching quotes.

If the

string = \"It is fun \\\"to write\\\" regular\\\"expressi         


        
相关标签:
3条回答
  • 2021-01-27 19:16

    You are running into a fundamental limitation of regular expressions here. In general they can't detect recursion, depth, etc.

    So in your string:

    "It is fun \"to write\" regular\"expression"
    

    Both the space between to and write and the space between \" and regular are all inside quote marks. Regex is not able to "count" the number of quotes in a flexible way and take action based on it.

    You will need to write your own string parser for this (or use an existing one). Regex can't handle it though.

    0 讨论(0)
  • 2021-01-27 19:31

    The trick is to use a flexible look ahead to assert that:

    • if there's an even number of quotes in the input, there should be an even number following the space, because an odd number means the space is within quotes
    • if there's an odd number of quotes in the input, there should be an odd number following the space, because an even number means the space is within quotes

    I got it into one line, but it's a whopper:

    String[] parts = str.split("(\\s+|(?<!\\s)(?=\"))(?=(([^\"]*\"){2})*[^\"]*"
                + (str.matches("(([^\"]*\"){2})*[^\"]*") ? "" : "\"[^\"]*") + "$)");
    

    This correctly splits the example string with or without the trailing quote (whether or not the trailing term includes a space).

    0 讨论(0)
  • 2021-01-27 19:34

    It seems that you just used regex from this answer, but as you could see it doesn't use split but find method from Matcher class. Also this answer takes care of ' where your input shows no signs of it.

    So you can improve this regex by removing parts handling ' which will make it look like

    [^\\s\"]+|\"([^\"]*)\"
    

    Also since you want to include " as part of token then you don't need to place match from between " in separate group, so get rid of parenthesis in \"([^\"]*)\" part

    [^\\s\"]+|\"[^\"]*\"
    

    Now all you need to do is add case where there will be no closing ", but instead you will get end of string. So change this regex to

    [^\\s\"]+|\"[^\"]*(\"|$)
    

    After this you can just use Matcher, find all store tokens somewhere, lets say in List.

    Example:

    String data = "It is fun \"to write\" regular\"expression";
    List<String> matchList = new ArrayList<String>();
    Pattern regex = Pattern.compile("[^\\s\"]+|\"[^\"]*(\"|$)");
    Matcher regexMatcher = regex.matcher(data);
    while (regexMatcher.find()) {
        System.out.println(regexMatcher.group());
        matchList.add(regexMatcher.group());
    }
    

    Output:

    It
    is
    fun
    "to write"
    regular
    "expression
    

    More complex expression to handle handle this data can look like

    String data = "It is fun \"to write\" regular \"expression";
    for(String s : data.split("(?<!\\G)(?<=\\G[^\"]*(\"[^\"]{0,100000}\")?[^\"]*)((?<=\"(?!\\s))|\\s+|(?=\"))"))
        System.out.println(s);
    

    but this approach is way overcomplicated then writing your own parser.


    Such parser could look like

    public static List<String> parse(String data) {
        List<String> tokens = new ArrayList<String>();
        StringBuilder sb = new StringBuilder();
        boolean insideQuote = false;
        char previous = '\0';
    
        for (char ch : data.toCharArray()) {
            if (ch == ' ' && !insideQuote) {
                if (sb.length() > 0 && previous != '"')
                    addTokenAndResetBuilder(sb, tokens);
            } else if (ch == '"') {
                if (insideQuote) {
                    sb.append(ch);
                    addTokenAndResetBuilder(sb, tokens);
                } else {
                    addTokenAndResetBuilder(sb, tokens);
                    sb.append(ch);
                }
                insideQuote = !insideQuote;
            } else {
                sb.append(ch);
            }
            previous = ch;
        }
        addTokenAndResetBuilder(sb, tokens);
    
        return tokens;
    }
    
    private static void addTokenAndResetBuilder(StringBuilder sb, List<String> list) {
        if (sb.length() > 0) {
            list.add(sb.toString());
            sb.delete(0, sb.length());
        }
    }
    

    Usage

    String data = "It is fun \"to write\" regular\"expression\"xxx\"yyy";
    for (String s : parse(data))
        System.out.println(s);
    
    0 讨论(0)
提交回复
热议问题