How to split a comma separated String while ignoring escaped commas?

后端 未结 4 1350
我寻月下人不归
我寻月下人不归 2020-11-29 03:43

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.

so calling my:

相关标签:
4条回答
  • 2020-11-29 04:06

    As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\,test\\,tes" , "test"]
    

    As drvdijk said, (?<!\\), will misinterpret escaped backslashes.

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\,test\\,test" , "test"]
      -(unescape commas)->
    ["test\\\\,test\\,test,test" , "test"]
    

    I would expect being able to escape backslashes as well...

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\" , "test\\,test" , "test"]
      -(unescape commas and backslashes)->
    ["test\\,test\\" , "test,test" , "test"]
    

    drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?

    I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).

    My answer does not take the escape character as a parameter.

    public static List<String> commaDelimitedListStringToStringList(String list) {
        // Check the validity of the list
        // ex: "te\\st" is not valid, backslash should be escaped
        if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
            // Could also raise an exception
            return null;
        }
        // Matcher for the list elements
        Matcher matcher = Pattern
                .compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
                .matcher(list);
        ArrayList<String> result = new ArrayList<String>();
        while (matcher.find()) {
            // Unescape the list element
            result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
        }
        return result;
    }
    

    Description for the pattern (unescaped):

    (?<=(^|,)) forward is start of string or a ,

    ([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,

    (?=(,|$)) behind is end of string or a ,

    The pattern may be simplified.

    Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.

    Also, what is the need of having an escape character if only one character is special, it could simply be doubled...

    public static List<String> commaDelimitedListStringToStringList2(String list) {
        if (!list.matches("^(([^,]|,,)*(,|$))+")) {
            return null;
        }
        Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
                        .matcher(list);
        ArrayList<String> result = new ArrayList<String>();
        while (matcher.find()) {
            result.add(matcher.group().replaceAll(",,", ","));
        }
        return result;
    }
    
    0 讨论(0)
  • 2020-11-29 04:08

    For future reference, here is the complete method i ended up with:

    public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
        // these characters need to be escaped in a regular expression
        String regularExpressionSpecialChars = "/.*+?|()[]{}\\";
    
        String escapedEscapeChar = escapeChar;
    
        // if the escape char for our comma separated list needs to be escaped 
        // for the regular expression, escape it using the \ char
        if(regularExpressionSpecialChars.indexOf(escapeChar) != -1) 
            escapedEscapeChar = "\\" + escapeChar;
    
        // see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
        String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);
    
        // remove the escapeChar for the end result
        String[] result = new String[temp.length];
        for(int i=0; i<temp.length; i++) {
            result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
        }
    
        return result;
    }
    
    0 讨论(0)
  • 2020-11-29 04:09

    Try:

    String array[] = str.split("(?<!\\\\),");
    

    Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

    0 讨论(0)
  • 2020-11-29 04:11

    The regular expression

    [^\\],
    

    means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.

    I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like

    (?<!\\),
    

    (BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

    0 讨论(0)
提交回复
热议问题