I have a String which i need to split based on the space and the exact matching quotes.
If the
string = \"It is fun \\\"to write\\\" regular\\\"expressi
You are running into a fundamental limitation of regular expressions here. In general they can't detect recursion, depth, etc.
So in your string:
"It is fun \"to write\" regular\"expression"
Both the space between to
and write
and the space between \"
and regular
are all inside quote marks. Regex is not able to "count" the number of quotes in a flexible way and take action based on it.
You will need to write your own string parser for this (or use an existing one). Regex can't handle it though.
The trick is to use a flexible look ahead to assert that:
I got it into one line, but it's a whopper:
String[] parts = str.split("(\\s+|(?<!\\s)(?=\"))(?=(([^\"]*\"){2})*[^\"]*"
+ (str.matches("(([^\"]*\"){2})*[^\"]*") ? "" : "\"[^\"]*") + "$)");
This correctly splits the example string with or without the trailing quote (whether or not the trailing term includes a space).
It seems that you just used regex from this answer, but as you could see it doesn't use split
but find
method from Matcher
class. Also this answer takes care of '
where your input shows no signs of it.
So you can improve this regex by removing parts handling '
which will make it look like
[^\\s\"]+|\"([^\"]*)\"
Also since you want to include "
as part of token then you don't need to place match from between "
in separate group, so get rid of parenthesis in \"([^\"]*)\"
part
[^\\s\"]+|\"[^\"]*\"
Now all you need to do is add case where there will be no closing "
, but instead you will get end of string. So change this regex to
[^\\s\"]+|\"[^\"]*(\"|$)
After this you can just use Matcher, find
all store tokens somewhere, lets say in List
.
Example:
String data = "It is fun \"to write\" regular\"expression";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"]+|\"[^\"]*(\"|$)");
Matcher regexMatcher = regex.matcher(data);
while (regexMatcher.find()) {
System.out.println(regexMatcher.group());
matchList.add(regexMatcher.group());
}
Output:
It
is
fun
"to write"
regular
"expression
More complex expression to handle handle this data can look like
String data = "It is fun \"to write\" regular \"expression";
for(String s : data.split("(?<!\\G)(?<=\\G[^\"]*(\"[^\"]{0,100000}\")?[^\"]*)((?<=\"(?!\\s))|\\s+|(?=\"))"))
System.out.println(s);
but this approach is way overcomplicated then writing your own parser.
Such parser could look like
public static List<String> parse(String data) {
List<String> tokens = new ArrayList<String>();
StringBuilder sb = new StringBuilder();
boolean insideQuote = false;
char previous = '\0';
for (char ch : data.toCharArray()) {
if (ch == ' ' && !insideQuote) {
if (sb.length() > 0 && previous != '"')
addTokenAndResetBuilder(sb, tokens);
} else if (ch == '"') {
if (insideQuote) {
sb.append(ch);
addTokenAndResetBuilder(sb, tokens);
} else {
addTokenAndResetBuilder(sb, tokens);
sb.append(ch);
}
insideQuote = !insideQuote;
} else {
sb.append(ch);
}
previous = ch;
}
addTokenAndResetBuilder(sb, tokens);
return tokens;
}
private static void addTokenAndResetBuilder(StringBuilder sb, List<String> list) {
if (sb.length() > 0) {
list.add(sb.toString());
sb.delete(0, sb.length());
}
}
Usage
String data = "It is fun \"to write\" regular\"expression\"xxx\"yyy";
for (String s : parse(data))
System.out.println(s);