Regular expression troubles, escaped quotes

问题

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell

Say I have the following string

"Hello\" World" "Hello Universe" Hi

How could I turn it into a 3 element list

Hello" World
Hello Universe
Hi

The following is my first attempt, but it's got a number of problems

It leaves the quote characters
It doesn't catch the escaped quote

Code:

public void test() {
    String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

So the incorrect output of the following code is

"Hello\"
World
" "
Hello
Universe
Hi

EDIT I'm totally open to a non regex solution. It's just the first solution that came to mind

回答1:

If you decide you want to forego regex, and do parsing instead, there are a couple of options. If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }

回答2:

I'm pretty sure you can't do this by just tokenising on a regex. If you need to deal with nested and escaped delimiters, you need to write a parser. See e.g. http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There will be open source parsers which can do what you want, although I don't know any. You should also check out the StreamTokenizer class.

回答3:

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash.

Step 1: tokenize the input: `/([ \t]+)|(\\")|(")|([^ \t"]+)/`

This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens.

Step 2: build a finite state machine matching and reacting to the tokens:

State: START

SPACE -> return empty string
ESCAPED_QUOTE -> Error (?)
QUOTE -> State := WITHIN_QUOTES
TEXT -> return text

State: WITHIN_QUOTES

SPACE -> add value to accumulator
ESCAPED_QUOTE -> add quote to accumulator
QUOTE -> return and clear accumulator; State := START
TEXT -> add text to accumulator

Step 3: Profit!!

回答4:

I think if you use pattern like this:

Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");

Then it will give you desired output. When I ran with your input data I got this list:

["Hello\" World", "Hello Universe", Hi]

I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+

EDIT

Change your opts.add(token); line to:

opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));

回答5:

The first thing you need to do is stop thinking of the job in terms of split(). split() is meant for breaking down simple strings like this/that/the other, where / is always a delimiter. But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes).

With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. What you want to do is match the tokens, not the delimiters.

In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). Any character may be escaped with a backslash, even in non-quoted tokens; the "escaping" backslashes are removed in a separate step.

public static void test()
{
  String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
  List<String> commands = parseCommands(str);
  for (String s : commands)
  {
    System.out.println(s);
  }
}

public static List<String> parseCommands(String s)
{
  String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\""  // double-quoted
             + "|'((?:[^'\\\\]++|\\\\.)*+)'"    // single-quoted
             + "|\\S+";                         // not quoted
  Pattern p = Pattern.compile(rgx);
  Matcher m = p.matcher(s);
  List<String> commands = new ArrayList<String>();
  while (m.find())
  {
    String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
               : m.start(2) != -1 ? m.group(2) // strip single-quotes
               : m.group();
    cmd = cmd.replaceAll("\\\\(.)", "$1");  // remove escape characters
    commands.add(cmd);
  }
  return commands;
}

output:

Hello" World
Hello Universe
Hi

This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library.

来源：https://stackoverflow.com/questions/6030400/regular-expression-troubles-escaped-quotes

标签

java

regex

unix

command-line